Yes, we can use Coqui with RVC!
Coqui is a text-to-speech framework (vocoder and encoder), but cloning your own voice takes decades and offers no guarantee of better results. That's why we use RVC (Retrieval-Based Voice Conversion), which works only for speech-to-speech. You can train the model with just 2-3 minutes of dataset as it uses Hubert (a pre-trained model to fine-tune quickly and provide better results).
How to use Coqui + RVC api?
IMPORTANT: Make sure you are using Python 3.10 or lower.
-
Clone the repository
git clone https://github.com/adrien-dimitri/tts-rvc
-
(Recommended) Create and Activate a Virtual Environment:
-
On Windows:
python -m venv .venv .venv\Scripts\activate
-
On macOS and Linux:
virtualenv <venv_name> source <venv_name>/bin/activate
-
-
Install the required packages:
pip install -r requirements.txt pip install TTS
-
Install the espeak backend for TTS:
Follow this guide on how to install the espeak backend for TTS: espeak backend Installation Guide
More information can be found here: espeak-ng
-
Install FFmpeg CLI:
-
On Windows, follow this guide: FFmpeg Installation Guide
-
On Linux, simply run the following command:
sudo apt-get install ffmpeg
More information can be found here: FFmpeg
-
-
Restart your PC.
-
Download the hubert model from one of the following links and place it in the
models/hubert
folder:hubert_base.pt - Google Drive (recommended)
-
Download a pretrained RVC model from the following link and place it in a folder in the
models/
folder:You will get the
rvc_model.pth
andrvc_model.index
files which you need to place in themodels/rvc_model_name
folder.The folder structure should look like this:
/ └── models └── rvc_model_name ├── rvc_model.pth └── rvc_model.index
-
Run the server:
python -m uvicorn app.main:app
-
Run the
api.py
file:python api.py
http://localhost:8000/generate
emotions : happy,sad,angry,dull
speed = 1.0 - 2.0
{
"speaker_name": "speaker3",
"input_text": "Hey there! Welcome to the world",
"emotion": "Surprise",
"speed": 1.0
}
import requests
import json
import time
url = "http://127.0.0.1:8000/generate"
payload = json.dumps({
"speaker_name": "speaker3",
"input_text": "Are you mad? The way you've betrayed me is beyond comprehension, a slap in the face that's left me boiling with an anger so intense it's as if you've thrown gasoline on a fire, utterly destroying any trust that was left.",
"emotion": "Dull",
"speed": 1.0
})
headers = {
'Content-Type': 'application/json'
}
start_time = time.time() # Start the timer
response = requests.request("POST", url, headers=headers, data=payload)
end_time = time.time() # Stop the timer
if response.status_code == 200:
audio_content = response.content
# Save the audio to a file
with open("generated_audio.wav", "wb") as audio_file:
audio_file.write(audio_content)
print("Audio saved successfully.")
print("Time taken:", end_time - start_time, "seconds")
else:
print("Error:", response.text)
To adjust the parameters of the TTS model, you need to locate the tts.py
file which can be found at app/routers/tts.py
in line 22.
Simply change the model_name
parameter to the desired model name.
model_name | language | gender |
---|---|---|
tts_models/de/thorsten/tacotron2-DDC | de | male |
tts_models/de/css10/vits-neon | de | female |
tts_models/en/ljspeech/vits | en | female |
tts_models/en/sam/tacotron-DDC | en | male |
To activate CUDA, change the gpu
parameter to True
.
The file should look like this:
# Initialize TTS
tts = TTS(model_name="tts_models/en/ljspeech/vits", progress_bar=True, gpu=False)
Change it to:
# Initialize TTS
tts = TTS(model_name="tts_models/en/ljspeech/vits", progress_bar=True, gpu=True)