[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

ukemamaster · 2024-01-15T11:08:08Z

Describe the bug

I have fine tuned XTTS v2 model on my own data containing both long and short audios (with the following histogram showing duration in seconds on x-axis. Labels 'old' and 'new' represent 2 datasets with long and short audios respectively.)

But the model produces strange sounds in case of 1-2 words text, like the following 2 examples for text='hola':

2.mp4

1.mp4

It seems like the model tries to produce at least 3 seconds audio even if the text is very short. And thus it adds some meaningless sounds to the sound of the original word in text.

@erogol Is there any way to avoid this behavior? or any parameter (may be in model args) to control this?
There are gpt_start_audio_token and gpt_stop_audio_token parameters in TTS.tts.models.xtts.XttsArgs() class but i am not sure what is the impact of these parameters?

To Reproduce

N/A

Expected behavior

Should produce short audio for short text.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.23.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

ukemamaster · 2024-01-16T09:21:36Z

I tried several times to re-cut the data into ranges from 0.5s to 20s, guaranteeing alignment with the corresponding text. But nothing improves. There might be a difference between model args in the training recipe and in the already trained model provided.

@erogol Can you please make sure the model args provided in the training recipe are the same as your own trained model?

bensonbs · 2024-01-17T03:38:42Z

Same Issues

ukemamaster · 2024-01-17T08:30:19Z

@bensonbs
Have you fine tuned the xtts-v2 model on your own dataset?
Can you share a histogram of the audio lengths of your dataset?
Have you tried to modify the training code or model args to avoid this?

insomnia777 · 2024-02-06T21:14:16Z

Same Issues

kaveenkumar · 2024-02-29T15:49:17Z

Same issue. Pre-trained XTTSv2 produces extra speech after the intended "text", 10-20% of the time

peterliu2023 · 2024-04-10T05:06:24Z

Same issue. The pretrained Xtts v2 generate extra speech randomly.

bensonbs · 2024-04-12T02:45:50Z

I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.

Code Snippet:
TTS/tts/layers/xtts/gpt.py

text_logits, mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

reject_text_logits, reject_mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)

loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)

TTS/tts/layers/xtts/trainer/gpt_trainer.py

        loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]

VRAM Usage and Training Time Comparison:
- Without DPO loss:
  VRAM usage: X GB
  Training time per epoch: Y minutes
- With DPO loss:
  VRAM usage: 2X GB
  Training time per epoch: 2Y minutes

insomnia777 · 2024-04-13T23:23:13Z

can you give me an explanation? and how to try it?

bensonbs · 2024-04-15T07:15:49Z

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ.
To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py.
I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

insomnia777 · 2024-04-15T14:56:10Z

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

Wouldn't it be easier to impose a penalty on the length of the generated sequence, based on median character-per-second data?

tuanh123789 · 2024-06-18T09:11:55Z

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

can you share some sample with DPO loss ?

saiful9379 · 2024-07-05T11:24:17Z

@bensonbs Thank you for your clear explanation, Could you please share some samples after applying DPO and the audio quality?

anhnh2002 · 2024-07-27T16:21:20Z

Same Issues

tuanh123789 · 2024-07-30T08:45:17Z

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

nvtinh368 · 2024-08-17T10:45:04Z

.

nvtinh368 · 2024-08-17T10:45:35Z

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

Hello, can you be more specific?

sushant-samespace · 2024-08-27T10:40:09Z

Hello @tuanh123789 , do you have any source to finetune dvae? Thanks

kerlynla · 2024-08-31T12:57:30Z

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

Bạn có model tiếng Việt nào đã finetune chưa ?

anhnh2002 · 2024-09-08T10:56:50Z

Hello @tuanh123789 , do you have any source to finetune dvae? Thanks

https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

rose07 · 2024-10-15T03:08:23Z

https://tts.byylook.com/ai/text-to-speech

JohannPie · 2024-10-25T21:28:42Z

So we cannot use the pretrained xttsv2 model? We have to finetune our own with dvae?

stale · 2024-12-08T09:24:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

eschmidbauer · 2024-12-26T13:36:59Z

I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.

Code Snippet: TTS/tts/layers/xtts/gpt.py

text_logits, mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

reject_text_logits, reject_mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)

loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)

TTS/tts/layers/xtts/trainer/gpt_trainer.py

        loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]

VRAM Usage and Training Time Comparison:
- Without DPO loss:
  VRAM usage: X GB
  Training time per epoch: Y minutes
- With DPO loss:
  VRAM usage: 2X GB
  Training time per epoch: 2Y minutes

I used these settings to train and i found that the avg_loss_text_ce does not seem to be improving with the settings. The light blue line are the settings mentioned here.

ukemamaster added the bug Something isn't working label Jan 15, 2024

olehsamoilenko mentioned this issue Aug 16, 2024

[Bug] Text duplication while audio generation idiap/coqui-ai-TTS#72

Closed

thivux mentioned this issue Sep 23, 2024

does finetuning the DVAE component make any difference? anhnh2002/XTTSv2-Finetuning-for-New-Languages#9

Closed

stale bot added the wontfix This will not be worked on but feel free to help. label Dec 8, 2024

stale bot removed the wontfix This will not be worked on but feel free to help. label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

ukemamaster commented Jan 15, 2024

ukemamaster commented Jan 16, 2024 •

edited

Loading

bensonbs commented Jan 17, 2024

ukemamaster commented Jan 17, 2024

insomnia777 commented Feb 6, 2024

kaveenkumar commented Feb 29, 2024

peterliu2023 commented Apr 10, 2024

bensonbs commented Apr 12, 2024 •

edited

Loading

insomnia777 commented Apr 13, 2024

bensonbs commented Apr 15, 2024 •

edited

Loading

insomnia777 commented Apr 15, 2024

tuanh123789 commented Jun 18, 2024

saiful9379 commented Jul 5, 2024 •

edited

Loading

anhnh2002 commented Jul 27, 2024

tuanh123789 commented Jul 30, 2024

nvtinh368 commented Aug 17, 2024 •

edited

Loading

nvtinh368 commented Aug 17, 2024

sushant-samespace commented Aug 27, 2024

kerlynla commented Aug 31, 2024

anhnh2002 commented Sep 8, 2024

rose07 commented Oct 15, 2024

JohannPie commented Oct 25, 2024

stale bot commented Dec 8, 2024

eschmidbauer commented Dec 26, 2024

[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

Comments

ukemamaster commented Jan 15, 2024

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

ukemamaster commented Jan 16, 2024 • edited Loading

bensonbs commented Jan 17, 2024

ukemamaster commented Jan 17, 2024

insomnia777 commented Feb 6, 2024

kaveenkumar commented Feb 29, 2024

peterliu2023 commented Apr 10, 2024

bensonbs commented Apr 12, 2024 • edited Loading

insomnia777 commented Apr 13, 2024

bensonbs commented Apr 15, 2024 • edited Loading

insomnia777 commented Apr 15, 2024

tuanh123789 commented Jun 18, 2024

saiful9379 commented Jul 5, 2024 • edited Loading

anhnh2002 commented Jul 27, 2024

tuanh123789 commented Jul 30, 2024

nvtinh368 commented Aug 17, 2024 • edited Loading

nvtinh368 commented Aug 17, 2024

sushant-samespace commented Aug 27, 2024

kerlynla commented Aug 31, 2024

anhnh2002 commented Sep 8, 2024

rose07 commented Oct 15, 2024

JohannPie commented Oct 25, 2024

stale bot commented Dec 8, 2024

eschmidbauer commented Dec 26, 2024

ukemamaster commented Jan 16, 2024 •

edited

Loading

bensonbs commented Apr 12, 2024 •

edited

Loading

bensonbs commented Apr 15, 2024 •

edited

Loading

saiful9379 commented Jul 5, 2024 •

edited

Loading

nvtinh368 commented Aug 17, 2024 •

edited

Loading