Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS Hallucinations in shorter phrases #1695

Open
adem-rguez opened this issue Jan 8, 2025 · 27 comments
Open

TTS Hallucinations in shorter phrases #1695

adem-rguez opened this issue Jan 8, 2025 · 27 comments

Comments

@adem-rguez
Copy link

adem-rguez commented Jan 8, 2025

i am running tts sherpa-onnx in unity (c#), i am having a problem where in the shorter sentences the generated audio tends to add extra audio containing gibberish at the end..

example long sentence (works fine) : "Bonjour monsieur, comment allez-vous aujourd’hui ? J’espère que vous passez une excellente journée !"
audio file: long sentence example

example short sentence (adds gibberish at the end): bonjour monsieur
audio file: short sentence example

in these examples i used umpc voice for french, but the same issues exists on other models.
for example on the libritts_r model when you generate "hello sir" it works, but when you generate "hello" immediately after it, it adds the previous text sometimes or part of it "hello sir" or "hello si".

@csukuangfj
Copy link
Collaborator

but when you generate "hello" immediately after it

Could you describe in detail how you tried it?

Do you first generate

hello sir

and then you invoke a second call to generate

hello

or

hello sir hello

?

@adem-rguez
Copy link
Author

adem-rguez commented Jan 8, 2025

in the english example:
first generated:

hello sir

then tried 3 times the text:

hello

i noticed that 2 of 3 times it adds "sir" or "si" after the "hello" ( "hello sir" or "hello si")
but then if i generate a longer phrase "hello how are you?" it doesn't hallucinate!
try this in my apk:
download unity tts apk

meanwhile in the french exaple it adds stuff the first time!

@csukuangfj
Copy link
Collaborator

Can you reproduce it with our APK?
https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

I think there is a bug in your apk if what you described can be reproduced with your APK.

@adem-rguez
Copy link
Author

you are right, it doesn't happen on your apk, the problem for me isn't just in the apk but even inside unity, using the code i shared earlier in the other thread. i am in need for french models in particular, the stuff they add at the end is not normal, and there are models that don't work at all on short sentences (they generate just distorted audio) like fr-FR_mls_medium.onnx.
for the french example i sent i used fr-FR-upmc-medium, and i used the espeak-ng-data folder, model path, and tokens.txt path, was i missing something?

@csukuangfj
Copy link
Collaborator

I just tried with your apk and I think there is a bug in your code.

Please make sure you have overwritten the buffer for the previous call .

Don't overwrite the buffer partially.

@csukuangfj
Copy link
Collaborator

fr-FR_mls_medium.onnx

Please don't use models containing mls in the filename.

I think I have deleted all models containing mls in its name.

@csukuangfj
Copy link
Collaborator

Or make sure you have cleared the buffer containing samples of the previous call before you play the samples of the current text.

@adem-rguez
Copy link
Author

the buffer is cleared already:

/// <summary>
    /// 1) Splits the text into sentences using multiple delimiters,
    /// 2) For each sentence, spawns a background thread to generate TTS,
    /// 3) Waits for generation to finish (without freezing the main thread),
    /// 4) Plays the resulting clip in order.
    /// </summary>
    private IEnumerator CoPlayTextBySentenceAsync(string text)
    {
        // More delimiters: period, question mark, exclamation, semicolon, colon
        // We also handle multiple punctuation in a row, etc.
        // This uses Regex to split on punctuation [.!?;:]+ 
        // Then trim the results and remove empties.
        // Split the text while keeping the punctuation with the preceding text
        string[] sentences = Regex.Matches(text, @"[^\.!\?;:]+[\.!\?;:]*")
            .Cast<Match>()
            .Select(m => m.Value.Trim())
            .Where(s => !string.IsNullOrWhiteSpace(s))
            .ToArray();


        if (sentences.Length == 0)
        {
            Debug.LogWarning("No valid sentences found in input text.");
            yield break;
        }

        Debug.Log("senteces #"+ sentences.Length.ToString() );

        foreach (string sentence in sentences)
        {
            Debug.Log("[Background TTS] Generating:"+ sentence );
            
            // Prepare a place to store the generated float[] 
            float[] generatedSamples = null;
            bool generationDone = false;

            // Run .Generate(...) on a background thread
            Thread t = new Thread(() =>
            {
                // Generate the audio for this sentence
                OfflineTtsGeneratedAudio generated = offlineTts.Generate(sentence, speed, speakerId);
                generatedSamples = generated.Samples;
                generationDone = true;
            });
            t.Start();

            // Wait until the thread signals it's done
            yield return new WaitUntil(() => generationDone);

            // Back on the main thread, we create the AudioClip and play it
            if (generatedSamples == null || generatedSamples.Length == 0)
            {
                Debug.LogWarning("Generated empty audio for a sentence. Skipping...");
                continue;
            }

            AudioClip clip = AudioClip.Create(
                "SherpaOnnxTTS-SentenceAsync",
                generatedSamples.Length,
                1,
                offlineTts.SampleRate,
                false
            );
            clip.SetData(generatedSamples, 0);

            sentenceAudioSource.clip = clip;
            sentenceAudioSource.Play();
            Debug.Log($"Playing sentence: \"{sentence}\"  length = {clip.length:F2}s");

            // Wait until playback finishes
            while (sentenceAudioSource.isPlaying)
                yield return null;
        }

        Debug.Log("All sentences have been generated (background) and played sequentially.");
    }

also this is if we are talking about the apk, but in the french version it's different, would you like me to provide an apk for french as well?

@csukuangfj
Copy link
Collaborator

but in the french version it's different,

Could you describe the differences? Does the APK for French use a different set of code from the APK for English?

@adem-rguez
Copy link
Author

adem-rguez commented Jan 8, 2025

no the same, just a different model, with different tokens file, what i mean by different, is the issue

@adem-rguez
Copy link
Author

from the first time i generate an audio in french it hallucinates other stuff in the end of the text, so it's not a buffer issue for french, i just mentioned the english apk thinking it was related

@csukuangfj
Copy link
Collaborator

I don't see any issues from your posted code.

@csukuangfj
Copy link
Collaborator

foreach (string sentence in sentences)

Is each sentence processed sequentially, not in parallel?

@adem-rguez
Copy link
Author

yes, sequentially, since the tts functions don't support streaming right now, it was the only option to make the generation faster

@csukuangfj
Copy link
Collaborator

it was the only option to make the generation faster

No, we support passing a callback to C++.

Inside C++, it processes the text sentence by sentence. After processing a sentence, the callback is invoked with the generated samples for this sentence.

Please try our Android APK first. You will find it plays almost immediately no matter how long the given text is.

Remeber to use the TTS APK, not the TTS Engine APK.

@adem-rguez
Copy link
Author

adem-rguez commented Jan 8, 2025

in the script i provided in the other thread there is a function that used that:

/// <summary>
    /// Attempted "streaming" approach. The callback is called only once in practice
    /// for the entire waveform, so it doesn't truly stream partial chunks.
    /// </summary>
    private void PlayTextStreamed(string text)
    {
        Debug.Log($"[Streaming] Generating TTS for text: '{text}'");

        int sampleRate = offlineTts.SampleRate;
        int maxAudioLengthInSamples = sampleRate * 300; // 5 min

        streamingClip = AudioClip.Create(
            "SherpaOnnxTTS-Streamed",
            maxAudioLengthInSamples,
            1,
            sampleRate,
            true,
            OnAudioRead,
            OnAudioSetPosition
        );

        if (streamingAudioSource == null)
            streamingAudioSource = gameObject.AddComponent<AudioSource>();

        streamingAudioSource.playOnAwake = false;
        streamingAudioSource.clip = streamingClip;
        streamingAudioSource.loop = false;

        streamingBuffer = new ConcurrentQueue<float>();
        samplesRead = 0;

        streamingAudioSource.Play();

        // This calls your callback, but typically only once for the entire wave
        offlineTts.GenerateWithCallback(text, speed, speakerId, MyTtsChunkCallback);

        Debug.Log("[Streaming] Playback started; awaiting streamed samples...");
    }

    private int MyTtsChunkCallback(System.IntPtr samplesPtr, int numSamples)
    {
        Debug.Log("chunk callback");
        if (numSamples <= 0)
            return 0;

        float[] chunk = new float[numSamples];
        System.Runtime.InteropServices.Marshal.Copy(samplesPtr, chunk, 0, numSamples);

        foreach (float sample in chunk)
            streamingBuffer.Enqueue(sample);

        return 0; 
    }

    private void OnAudioRead(float[] data)
    {
        for (int i = 0; i < data.Length; i++)
        {
            if (streamingBuffer.TryDequeue(out float sample))
            {
                data[i] = sample;
                samplesRead++;
            }
            else
            {
                data[i] = 0f; // fill silence
            }
        }
    }

    private void OnAudioSetPosition(int newPosition)
    {
        Debug.Log($"[Streaming] OnAudioSetPosition => {newPosition}");
    }

as you can see it's implementend with the generatewithcallback function, but when i use it the callback is only called once at the end.

here is an example:
image
also i don't think it's related to the hallucination problem i mentioned, sadly :(

@csukuangfj
Copy link
Collaborator

Could you enable the debug in tts model config and post the logs when you generate samples?

int32_t debug;

@adem-rguez
Copy link
Author

adem-rguez commented Jan 8, 2025

i don't get any logs, that's the weird part, unity is not showing me any logs except the ones i made! am i doing something wrong?

// 1. Prepare the VITS model config
        var vitsConfig = new OfflineTtsVitsModelConfig
        {
            Model = BuildPath(modelPath),
            Lexicon = BuildPath(lexiconPath),
            Tokens = BuildPath(tokensPath),
            DataDir = BuildPath(espeakDir),
            DictDir = BuildPath(dictDirPath),

            NoiseScale = noiseScale,
            NoiseScaleW = noiseScaleW,
            LengthScale = lengthScale
        };

        // 2. Wrap it inside the ModelConfig
        var modelConfig = new OfflineTtsModelConfig
        {
            Vits = vitsConfig,
            NumThreads = numThreads,
            Debug = 1,
            Provider = provider
        };

        // 3. Create the top-level OfflineTtsConfig
        var ttsConfig = new OfflineTtsConfig
        {
            Model = modelConfig,
            RuleFsts = "",
            MaxNumSentences = maxNumSentences,
            RuleFars = ""
        };

        // 4. Instantiate the OfflineTts object
        Debug.Log("will create offline tts now!");
        offlineTts = new OfflineTts(ttsConfig);
        Debug.Log($"OfflineTts created! SampleRate: {offlineTts.SampleRate}, NumSpeakers: {offlineTts.NumSpeakers}");

@csukuangfj
Copy link
Collaborator

IIRC, you posted some error logs in your first issue in the other session. How did you get them?

@adem-rguez
Copy link
Author

from log cat that was in an apk using logcat, for some reason unity doesn't show the errors directly, hold tight, i will use log cat again

@adem-rguez
Copy link
Author

so this is from logcat:
image
this part isn't supposed to be there:
image
the raw text is having random stuff added to it..

this example might be easier to understand:
image

it had an "u" added to it, this was made using the generate function:
image

@adem-rguez
Copy link
Author

it's sherpa that's logging that yellow raw text warning, but i am unable to get its stack trace

@csukuangfj
Copy link
Collaborator

            OfflineTtsGeneratedAudio generated = offlineTts.Generate(sentence, speed, speakerId);

Please show the code for offlineTts.Generate

@adem-rguez
Copy link
Author

adem-rguez commented Jan 9, 2025

good morning, thank you for your reply!
it's read-only for me, i am using it straight from the nuget package, in order to modify for me, i would need to make a copy of it and use the copy:

#region Assembly sherpa-onnx, Version=1.10.38.0, Culture=neutral, PublicKeyToken=null
// D:\Unity Projects 2\Sherpa-onnx-Unity-main\Assets\Packages\org.k2fsa.sherpa.onnx.1.10.38\lib\netstandard2.0\sherpa-onnx.dll
// Decompiled with ICSharpCode.Decompiler 8.1.1.7464
#endregion

using System;
using System.Runtime.InteropServices;
using System.Text;

namespace SherpaOnnx;

public class OfflineTts : IDisposable
{
    private HandleRef _handle;

    public int SampleRate => SherpaOnnxOfflineTtsSampleRate(_handle.Handle);

    public int NumSpeakers => SherpaOnnxOfflineTtsNumSpeakers(_handle.Handle);

    public OfflineTts(OfflineTtsConfig config)
    {
        IntPtr handle = SherpaOnnxCreateOfflineTts(ref config);
        _handle = new HandleRef(this, handle);
    }

    public OfflineTtsGeneratedAudio Generate(string text, float speed, int speakerId)
    {
        byte[] bytes = Encoding.UTF8.GetBytes(text);
        return new OfflineTtsGeneratedAudio(SherpaOnnxOfflineTtsGenerate(_handle.Handle, bytes, speakerId, speed));
    }

    public OfflineTtsGeneratedAudio GenerateWithCallback(string text, float speed, int speakerId, OfflineTtsCallback callback)
    {
        byte[] bytes = Encoding.UTF8.GetBytes(text);
        return new OfflineTtsGeneratedAudio(SherpaOnnxOfflineTtsGenerateWithCallback(_handle.Handle, bytes, speakerId, speed, callback));
    }

    public void Dispose()
    {
        Cleanup();
        GC.SuppressFinalize(this);
    }

    ~OfflineTts()
    {
        Cleanup();
    }

    private void Cleanup()
    {
        SherpaOnnxDestroyOfflineTts(_handle.Handle);
        _handle = new HandleRef(this, IntPtr.Zero);
    }

    [DllImport("sherpa-onnx-c-api")]
    private static extern IntPtr SherpaOnnxCreateOfflineTts(ref OfflineTtsConfig config);

    [DllImport("sherpa-onnx-c-api")]
    private static extern void SherpaOnnxDestroyOfflineTts(IntPtr handle);

    [DllImport("sherpa-onnx-c-api")]
    private static extern int SherpaOnnxOfflineTtsSampleRate(IntPtr handle);

    [DllImport("sherpa-onnx-c-api")]
    private static extern int SherpaOnnxOfflineTtsNumSpeakers(IntPtr handle);

    [DllImport("sherpa-onnx-c-api")]
    private static extern IntPtr SherpaOnnxOfflineTtsGenerate(IntPtr handle, [MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I1)] byte[] utf8Text, int sid, float speed);

    [DllImport("sherpa-onnx-c-api", CallingConvention = CallingConvention.Cdecl)]
    private static extern IntPtr SherpaOnnxOfflineTtsGenerateWithCallback(IntPtr handle, [MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I1)] byte[] utf8Text, int sid, float speed, OfflineTtsCallback callback);
}
#if false // Decompilation log
'238' items in cache
------------------
Resolve: 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
Found single assembly: 'netstandard, Version=2.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
WARN: Version mismatch. Expected: '2.0.0.0', Got: '2.1.0.0'
Load from: 'D:\Unity Installs\2022.3.55f1\Editor\Data\NetStandard\ref\2.1.0\netstandard.dll'
------------------
Resolve: 'System.Runtime.InteropServices, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null'
Found single assembly: 'System.Runtime.InteropServices, Version=4.1.2.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
WARN: Version mismatch. Expected: '2.0.0.0', Got: '4.1.2.0'
Load from: 'D:\Unity Installs\2022.3.55f1\Editor\Data\NetStandard\compat\2.1.0\shims\netstandard\System.Runtime.InteropServices.dll'
------------------
Resolve: 'System.Runtime.CompilerServices.Unsafe, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null'
Could not find by name: 'System.Runtime.CompilerServices.Unsafe, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null'
------------------
Resolve: 'netstandard, Version=2.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
Found single assembly: 'netstandard, Version=2.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
Load from: 'D:\Unity Installs\2022.3.55f1\Editor\Data\NetStandard\ref\2.1.0\netstandard.dll'
#endif

@adem-rguez
Copy link
Author

hello @csukuangfj, any solution yet?

@csukuangfj
Copy link
Collaborator

The code looks correct.

Can you reproduce it with our example code in the dotnet-examples folder?

@adem-rguez
Copy link
Author

sorry but i wasn't able to do that, i kinda lack experience of coding outside of unity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants