Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion

2019
The statistical approach to voice conversiontypically consists of a feature conversionmodule followed by a vocoder. So far, the feature conversionstudies are mainly focused on the conversionof spectrum. However, speaker identity is also characterized by prosodic features, such as fundamental frequency (F0) and energy contour among others. In this paper, we study the transformation of speaker characteristics both in terms of spectrum and prosody. We propose two novel techniques that effectively use a limited amount of source-target training data and leverage a large general speech corpusto improve the voice conversionquality. First, we study the phoneticsparse representation under the group sparsity mathematical formulation. We use phoneticposteriorgrams (PPGs) together with spectral and prosodyfeatures to form tandem feature in the phoneticdictionary. The tandem feature allow us to estimate an activation matrixthat is less dependent on source speakers, thus providing a better voice conversionquality. Second, we study the use of WaveNet vocoder that can be trained on general speech corpusfrom multiple speakers and adapted on target speaker data to improve the vocoding quality. We benefit from the large general speech databases that are used to train the PPG generator, and the WaveNet vocoder. The experiments show that the proposed conversionframework outperforms the traditional spectrum and prosody conversiontechniques in both objective and subjective evaluations.
    • Correction
    • Source
    • Cite
    • Save
    0
    References
    43
    Citations
    NaN
    KQI
    []
    Baidu
    map