Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing SVC inference speed #193

Open
KaikeWesleyReis opened this issue May 31, 2024 · 3 comments
Open

Increasing SVC inference speed #193

KaikeWesleyReis opened this issue May 31, 2024 · 3 comments

Comments

@KaikeWesleyReis
Copy link

Hi,
I'm developing a personal project of a conversational chatbot. The idea is quite simple: Have a chat with Harbinger, the first reaper (from mass effect series). I found a optimal solution to generate his voice through text: Using a vits-ljspeech-base from Coqui TTS (without any fine tuning) to generate a audio and use your SVC fine tuned to add the voice over the generated audio. For example, given this sentence:

Organic intellect, fascinated by the patterns of the universe. I, Harbinger, have witnessed the harmony of numbers governing the cosmos. The intricate dance of primes, the elegance of elliptic curves, and the recursion of Fibonacci's sequence all resonate with my being. Which aspect of number theory would you like to dissect, researcher?

I have this time for each step to :
time

Now I'm studying the inference code of your model and so far I have the following ideas:

  • Whisper tiny instead of large

It's possible to cut or pre-generate any vector to reduce other models inference (whispper, hubert, pitch and so on) and thus, svc inference time?

Btw, thanks for your repository: is the easiest for "prepare your data and run" that I got so far in deep learning field.

Cheers from Brazil,

@ShadowLoveElysia
Copy link

Hey, I understand your thinking, and what you're doing is totally fine. But I have to give you a reality check. Yes, SVC can be used for audio replacement, but it seems like you're over-engineering it. You could just use TTS projects like GPT-Sovits instead of converting an existing TTS.

@ShadowLoveElysia
Copy link

If you want to do voice conversion, then this project is definitely fine, but if your goal is just TTS, then GPT-Sovits is sufficient.

@KaikeWesleyReis
Copy link
Author

KaikeWesleyReis commented Jun 2, 2024

@ShadowLoveElysia

If you want to do voice conversion, then this project is definitely fine, but if your goal is just TTS, then GPT-Sovits is sufficient.

GPT-Sovits have the same idea of VITS fine tuning that I have done? Don't you think that I'll fall in the same mistakes of VITS fine tuning?

My voice is this: https://www.youtube.com/watch?v=YZt6NKrkdzQ&

Given the voice nature, do you believe that is possible to fine tune GPT-Sovits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants