Voice Conversion

MLSP Project, Istanbul Technical University
Fall 2020

Selahaddin HONİ  |  İsmail Melik TÜRKER  |  İmran Çağla EYÜBOĞLU

Project Aim


In the project, it is aimed to transfer the trained voice style of a famous person to given input voice. Follow the links for final report and brief presentation.


Reference Paper & Implementation


[Paper link]
Takuhiro Kaneko and Hirokazu Kameoka
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

[Implementation link]
Lei Mao's work
Voice Converter Using CycleGAN and Non-Parallel Data

[Project]
Our Work
MLSP Term Project: Voice Conversion


Dataset


Source
Google's text-to-speech voices are used to generate 13 audio clips (each in a duration of approx. 40 secs) in a total of at least 8 minutes for each speaker.
Female Speaker WaveNet Turkish Female voice G
Male Speaker WaveNet Turkish Male voice E

Target
Similarly, 13 audio clips in a total duration of 8.8 minutes of Turkish news-presenter Ece Uner's speech is chosen.


The dataset link is here.


Result


Training Samples
Below speech samples are given into the model in training phase. There is no parallelism between source and target.


Source (Female) Source (Male) Target (Female)

Validation Samples
In the first row of table, original record of source speech is placed.
Remaining rows are the outputs of #-epochs trained models for given input.


CycleGAN-VC (Female-to-Female) CycleGAN-VC (Male-to-Female)
Input
500 Epochs
1500 Epochs
5000 Epochs

Input "Merhaba, bu ses CycleGAN ile üretildi." (Hi, this voice is generated by CycleGAN)
*Of course, the input speech is not generated by this network; however 'the outputs' are.


There are 3 more examples in this folder.