Skip to the content.

MAIN-VC

Abstract

One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulty in satisfactory speech representation disentanglement and suffer from sizable networks. We propose a method to effectively disentangle with a concise neural network. Our model learns clean speech representations via siamese encoders with the enhancement of the designed mutual information estimator. The siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring the performance in diverse voice conversion tasks.

Overview

Figure.1 The Architecture of MAIN-VC

 

Audio Samples in VC Tasks

Audio samples are taken from the VCTK dataset and AISHELL dataset.

Table.1 Traditional Voice Conversion (seen speakers)
Source Target Baseline MAIN-VC
F2F
M2M
F2M
M2F

 

Table.2 One-shot Voice Conversion (unseen speakers)
Source Target Baseline MAIN-VC
F2F
M2M
F2M
M2F

 

Table.3 Cross-lingual One-shot Voice Conversion (Mandarin-to-English, unseen speakers)
Source Target MAIN-VC
M2E F2M
M2E M2F
M2E F2M
M2E M2M

 

Mel-Spectrogram Samples in One-shot VC

Figure.2 VC task "The rainbow is a division of white light into many beautiful colors."

Figure.3 VC task "People look, but no one finds it."

Figure.4 VC task "Some have accepted it as a miracle without physical explanation."

 

Ablation Study

Compare MAIN-VC with:

Table.3 Ablation Study
Source Target w/o CMI w/o SE MAIN-VC