Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion
2019
The statistical approach to voice
conversiontypically consists of a feature
conversionmodule followed by a vocoder. So far, the feature
conversionstudies are mainly focused on the
conversionof spectrum. However, speaker identity is also characterized by prosodic features, such as fundamental frequency (F0) and energy contour among others. In this paper, we study the transformation of speaker characteristics both in terms of spectrum and
prosody. We propose two novel techniques that effectively use a limited amount of source-target training data and leverage a large general
speech corpusto improve the voice
conversionquality. First, we study the
phoneticsparse representation under the group sparsity mathematical formulation. We use
phoneticposteriorgrams (PPGs) together with spectral and
prosodyfeatures to form tandem feature in the
phoneticdictionary. The tandem feature allow us to estimate an
activation matrixthat is less dependent on source speakers, thus providing a better voice
conversionquality. Second, we study the use of WaveNet vocoder that can be trained on general
speech corpusfrom multiple speakers and adapted on target speaker data to improve the vocoding quality. We benefit from the large general speech databases that are used to train the PPG generator, and the WaveNet vocoder. The experiments show that the proposed
conversionframework outperforms the traditional spectrum and
prosody
conversiontechniques in both objective and subjective evaluations.
Keywords:
-
Correction
-
Source
-
Cite
-
Save
0
References
43
Citations
NaN
KQI