-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516
Comments
I tried several times to re-cut the data into ranges from 0.5s to 20s, guaranteeing alignment with the corresponding text. But nothing improves. There might be a difference between model args in the training recipe and in the already trained model provided. @erogol Can you please make sure the model args provided in the training recipe are the same as your own trained model? |
Same Issues |
@bensonbs |
Same Issues |
Same issue. Pre-trained XTTSv2 produces extra speech after the intended "text", 10-20% of the time |
Same issue. The pretrained Xtts v2 generate extra speech randomly. |
I have implemented the Diversified Perturbation Optimized (DPO) loss in Code Snippet: text_logits, mel_logits = self.get_logits(
text_emb,
self.text_head,
mel_emb,
self.mel_head,
prompt=cond_latents,
get_attns=return_attentions,
return_latent=return_latent,
attn_mask_cond=attn_mask_cond,
attn_mask_text=attn_mask_text,
attn_mask_mel=attn_mask_mel,
)
reject_text_logits, reject_mel_logits = self.get_logits(
text_emb,
self.text_head,
mel_emb,
self.mel_head,
prompt=cond_latents,
get_attns=return_attentions,
return_latent=return_latent,
attn_mask_cond=attn_mask_cond,
attn_mask_text=attn_mask_text,
attn_mask_mel=attn_mask_mel,
) text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)
loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)
loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]
|
can you give me an explanation? and how to try it? |
When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. Methods can refer to the modifications in This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files. |
Wouldn't it be easier to impose a penalty on the length of the generated sequence, based on median character-per-second data? |
can you share some sample with DPO loss ? |
@bensonbs Thank you for your clear explanation, Could you please share some samples after applying DPO and the audio quality? |
Same Issues |
Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D |
. |
Hello, can you be more specific? |
Hello @tuanh123789 , do you have any source to finetune dvae? Thanks |
Bạn có model tiếng Việt nào đã finetune chưa ? |
https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages |
So we cannot use the pretrained xttsv2 model? We have to finetune our own with dvae? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Describe the bug
I have fine tuned XTTS v2 model on my own data containing both long and short audios (with the following histogram showing duration in seconds on x-axis. Labels 'old' and 'new' represent 2 datasets with long and short audios respectively.)
But the model produces strange sounds in case of 1-2 words text, like the following 2 examples for
text='hola'
:2.mp4
1.mp4
It seems like the model tries to produce at least 3 seconds audio even if the text is very short. And thus it adds some meaningless sounds to the sound of the original word in text.
@erogol Is there any way to avoid this behavior? or any parameter (may be in model args) to control this?
There are
gpt_start_audio_token
andgpt_stop_audio_token
parameters inTTS.tts.models.xtts.XttsArgs()
class but i am not sure what is the impact of these parameters?To Reproduce
N/A
Expected behavior
Should produce short audio for short text.
Logs
No response
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: