TTS Hallucinations in shorter phrases #1695

adem-rguez · 2025-01-08T07:54:55Z

i am running tts sherpa-onnx in unity (c#), i am having a problem where in the shorter sentences the generated audio tends to add extra audio containing gibberish at the end..

example long sentence (works fine) : "Bonjour monsieur, comment allez-vous aujourd’hui ? J’espère que vous passez une excellente journée !"
audio file: long sentence example

example short sentence (adds gibberish at the end): bonjour monsieur
audio file: short sentence example

in these examples i used umpc voice for french, but the same issues exists on other models.
for example on the libritts_r model when you generate "hello sir" it works, but when you generate "hello" immediately after it, it adds the previous text sometimes or part of it "hello sir" or "hello si".

csukuangfj · 2025-01-08T07:57:01Z

but when you generate "hello" immediately after it

Could you describe in detail how you tried it?

Do you first generate

hello sir

and then you invoke a second call to generate

hello

or

hello sir hello

?

adem-rguez · 2025-01-08T08:04:36Z

in the english example:
first generated:

hello sir

then tried 3 times the text:

hello

i noticed that 2 of 3 times it adds "sir" or "si" after the "hello" ( "hello sir" or "hello si")
but then if i generate a longer phrase "hello how are you?" it doesn't hallucinate!
try this in my apk:
download unity tts apk

meanwhile in the french exaple it adds stuff the first time!

csukuangfj · 2025-01-08T08:06:35Z

Can you reproduce it with our APK?
https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

I think there is a bug in your apk if what you described can be reproduced with your APK.

adem-rguez · 2025-01-08T08:14:57Z

you are right, it doesn't happen on your apk, the problem for me isn't just in the apk but even inside unity, using the code i shared earlier in the other thread. i am in need for french models in particular, the stuff they add at the end is not normal, and there are models that don't work at all on short sentences (they generate just distorted audio) like fr-FR_mls_medium.onnx.
for the french example i sent i used fr-FR-upmc-medium, and i used the espeak-ng-data folder, model path, and tokens.txt path, was i missing something?

csukuangfj · 2025-01-08T08:15:08Z

I just tried with your apk and I think there is a bug in your code.

Please make sure you have overwritten the buffer for the previous call .

Don't overwrite the buffer partially.

csukuangfj · 2025-01-08T08:15:54Z

fr-FR_mls_medium.onnx

Please don't use models containing mls in the filename.

I think I have deleted all models containing mls in its name.

csukuangfj · 2025-01-08T08:18:11Z

Or make sure you have cleared the buffer containing samples of the previous call before you play the samples of the current text.

adem-rguez · 2025-01-08T08:34:21Z

the buffer is cleared already:

/// <summary>
    /// 1) Splits the text into sentences using multiple delimiters,
    /// 2) For each sentence, spawns a background thread to generate TTS,
    /// 3) Waits for generation to finish (without freezing the main thread),
    /// 4) Plays the resulting clip in order.
    /// </summary>
    private IEnumerator CoPlayTextBySentenceAsync(string text)
    {
        // More delimiters: period, question mark, exclamation, semicolon, colon
        // We also handle multiple punctuation in a row, etc.
        // This uses Regex to split on punctuation [.!?;:]+ 
        // Then trim the results and remove empties.
        // Split the text while keeping the punctuation with the preceding text
        string[] sentences = Regex.Matches(text, @"[^\.!\?;:]+[\.!\?;:]*")
            .Cast<Match>()
            .Select(m => m.Value.Trim())
            .Where(s => !string.IsNullOrWhiteSpace(s))
            .ToArray();


        if (sentences.Length == 0)
        {
            Debug.LogWarning("No valid sentences found in input text.");
            yield break;
        }

        Debug.Log("senteces #"+ sentences.Length.ToString() );

        foreach (string sentence in sentences)
        {
            Debug.Log("[Background TTS] Generating:"+ sentence );
            
            // Prepare a place to store the generated float[] 
            float[] generatedSamples = null;
            bool generationDone = false;

            // Run .Generate(...) on a background thread
            Thread t = new Thread(() =>
            {
                // Generate the audio for this sentence
                OfflineTtsGeneratedAudio generated = offlineTts.Generate(sentence, speed, speakerId);
                generatedSamples = generated.Samples;
                generationDone = true;
            });
            t.Start();

            // Wait until the thread signals it's done
            yield return new WaitUntil(() => generationDone);

            // Back on the main thread, we create the AudioClip and play it
            if (generatedSamples == null || generatedSamples.Length == 0)
            {
                Debug.LogWarning("Generated empty audio for a sentence. Skipping...");
                continue;
            }

            AudioClip clip = AudioClip.Create(
                "SherpaOnnxTTS-SentenceAsync",
                generatedSamples.Length,
                1,
                offlineTts.SampleRate,
                false
            );
            clip.SetData(generatedSamples, 0);

            sentenceAudioSource.clip = clip;
            sentenceAudioSource.Play();
            Debug.Log($"Playing sentence: \"{sentence}\"  length = {clip.length:F2}s");

            // Wait until playback finishes
            while (sentenceAudioSource.isPlaying)
                yield return null;
        }

        Debug.Log("All sentences have been generated (background) and played sequentially.");
    }

also this is if we are talking about the apk, but in the french version it's different, would you like me to provide an apk for french as well?

csukuangfj · 2025-01-08T08:37:54Z

but in the french version it's different,

Could you describe the differences? Does the APK for French use a different set of code from the APK for English?

adem-rguez · 2025-01-08T08:38:31Z

no the same, just a different model, with different tokens file, what i mean by different, is the issue

adem-rguez · 2025-01-08T08:41:18Z

from the first time i generate an audio in french it hallucinates other stuff in the end of the text, so it's not a buffer issue for french, i just mentioned the english apk thinking it was related

csukuangfj · 2025-01-08T08:57:44Z

I don't see any issues from your posted code.

csukuangfj · 2025-01-08T08:58:20Z

foreach (string sentence in sentences)

Is each sentence processed sequentially, not in parallel?

adem-rguez · 2025-01-08T10:02:51Z

yes, sequentially, since the tts functions don't support streaming right now, it was the only option to make the generation faster

csukuangfj · 2025-01-08T10:26:32Z

it was the only option to make the generation faster

No, we support passing a callback to C++.

Inside C++, it processes the text sentence by sentence. After processing a sentence, the callback is invoked with the generated samples for this sentence.

Please try our Android APK first. You will find it plays almost immediately no matter how long the given text is.

Remeber to use the TTS APK, not the TTS Engine APK.

adem-rguez · 2025-01-08T10:34:05Z

in the script i provided in the other thread there is a function that used that:

/// <summary>
    /// Attempted "streaming" approach. The callback is called only once in practice
    /// for the entire waveform, so it doesn't truly stream partial chunks.
    /// </summary>
    private void PlayTextStreamed(string text)
    {
        Debug.Log($"[Streaming] Generating TTS for text: '{text}'");

        int sampleRate = offlineTts.SampleRate;
        int maxAudioLengthInSamples = sampleRate * 300; // 5 min

        streamingClip = AudioClip.Create(
            "SherpaOnnxTTS-Streamed",
            maxAudioLengthInSamples,
            1,
            sampleRate,
            true,
            OnAudioRead,
            OnAudioSetPosition
        );

        if (streamingAudioSource == null)
            streamingAudioSource = gameObject.AddComponent<AudioSource>();

        streamingAudioSource.playOnAwake = false;
        streamingAudioSource.clip = streamingClip;
        streamingAudioSource.loop = false;

        streamingBuffer = new ConcurrentQueue<float>();
        samplesRead = 0;

        streamingAudioSource.Play();

        // This calls your callback, but typically only once for the entire wave
        offlineTts.GenerateWithCallback(text, speed, speakerId, MyTtsChunkCallback);

        Debug.Log("[Streaming] Playback started; awaiting streamed samples...");
    }

    private int MyTtsChunkCallback(System.IntPtr samplesPtr, int numSamples)
    {
        Debug.Log("chunk callback");
        if (numSamples <= 0)
            return 0;

        float[] chunk = new float[numSamples];
        System.Runtime.InteropServices.Marshal.Copy(samplesPtr, chunk, 0, numSamples);

        foreach (float sample in chunk)
            streamingBuffer.Enqueue(sample);

        return 0; 
    }

    private void OnAudioRead(float[] data)
    {
        for (int i = 0; i < data.Length; i++)
        {
            if (streamingBuffer.TryDequeue(out float sample))
            {
                data[i] = sample;
                samplesRead++;
            }
            else
            {
                data[i] = 0f; // fill silence
            }
        }
    }

    private void OnAudioSetPosition(int newPosition)
    {
        Debug.Log($"[Streaming] OnAudioSetPosition => {newPosition}");
    }

as you can see it's implementend with the generatewithcallback function, but when i use it the callback is only called once at the end.

here is an example:

also i don't think it's related to the hallucination problem i mentioned, sadly :(

csukuangfj · 2025-01-08T10:56:54Z

Could you enable the debug in tts model config and post the logs when you generate samples?

sherpa-onnx/sherpa-onnx/c-api/c-api.h

Line 916 in 0cb2db3

int32_t debug;

adem-rguez · 2025-01-08T11:13:13Z

i don't get any logs, that's the weird part, unity is not showing me any logs except the ones i made! am i doing something wrong?

// 1. Prepare the VITS model config
        var vitsConfig = new OfflineTtsVitsModelConfig
        {
            Model = BuildPath(modelPath),
            Lexicon = BuildPath(lexiconPath),
            Tokens = BuildPath(tokensPath),
            DataDir = BuildPath(espeakDir),
            DictDir = BuildPath(dictDirPath),

            NoiseScale = noiseScale,
            NoiseScaleW = noiseScaleW,
            LengthScale = lengthScale
        };

        // 2. Wrap it inside the ModelConfig
        var modelConfig = new OfflineTtsModelConfig
        {
            Vits = vitsConfig,
            NumThreads = numThreads,
            Debug = 1,
            Provider = provider
        };

        // 3. Create the top-level OfflineTtsConfig
        var ttsConfig = new OfflineTtsConfig
        {
            Model = modelConfig,
            RuleFsts = "",
            MaxNumSentences = maxNumSentences,
            RuleFars = ""
        };

        // 4. Instantiate the OfflineTts object
        Debug.Log("will create offline tts now!");
        offlineTts = new OfflineTts(ttsConfig);
        Debug.Log($"OfflineTts created! SampleRate: {offlineTts.SampleRate}, NumSpeakers: {offlineTts.NumSpeakers}");

csukuangfj · 2025-01-08T11:37:05Z

IIRC, you posted some error logs in your first issue in the other session. How did you get them?

adem-rguez · 2025-01-08T12:02:55Z

from log cat that was in an apk using logcat, for some reason unity doesn't show the errors directly, hold tight, i will use log cat again

adem-rguez · 2025-01-08T12:21:59Z

so this is from logcat:

this part isn't supposed to be there:

the raw text is having random stuff added to it..

this example might be easier to understand:

it had an "u" added to it, this was made using the generate function:

adem-rguez · 2025-01-08T12:24:58Z

it's sherpa that's logging that yellow raw text warning, but i am unable to get its stack trace

csukuangfj · 2025-01-09T02:22:22Z

            OfflineTtsGeneratedAudio generated = offlineTts.Generate(sentence, speed, speakerId);

Please show the code for offlineTts.Generate

adem-rguez · 2025-01-09T06:00:20Z

good morning, thank you for your reply!
it's read-only for me, i am using it straight from the nuget package, in order to modify for me, i would need to make a copy of it and use the copy:

#region Assembly sherpa-onnx, Version=1.10.38.0, Culture=neutral, PublicKeyToken=null
// D:\Unity Projects 2\Sherpa-onnx-Unity-main\Assets\Packages\org.k2fsa.sherpa.onnx.1.10.38\lib\netstandard2.0\sherpa-onnx.dll
// Decompiled with ICSharpCode.Decompiler 8.1.1.7464
#endregion

using System;
using System.Runtime.InteropServices;
using System.Text;

namespace SherpaOnnx;

public class OfflineTts : IDisposable
{
    private HandleRef _handle;

    public int SampleRate => SherpaOnnxOfflineTtsSampleRate(_handle.Handle);

    public int NumSpeakers => SherpaOnnxOfflineTtsNumSpeakers(_handle.Handle);

    public OfflineTts(OfflineTtsConfig config)
    {
        IntPtr handle = SherpaOnnxCreateOfflineTts(ref config);
        _handle = new HandleRef(this, handle);
    }

    public OfflineTtsGeneratedAudio Generate(string text, float speed, int speakerId)
    {
        byte[] bytes = Encoding.UTF8.GetBytes(text);
        return new OfflineTtsGeneratedAudio(SherpaOnnxOfflineTtsGenerate(_handle.Handle, bytes, speakerId, speed));
    }

    public OfflineTtsGeneratedAudio GenerateWithCallback(string text, float speed, int speakerId, OfflineTtsCallback callback)
    {
        byte[] bytes = Encoding.UTF8.GetBytes(text);
        return new OfflineTtsGeneratedAudio(SherpaOnnxOfflineTtsGenerateWithCallback(_handle.Handle, bytes, speakerId, speed, callback));
    }

    public void Dispose()
    {
        Cleanup();
        GC.SuppressFinalize(this);
    }

    ~OfflineTts()
    {
        Cleanup();
    }

    private void Cleanup()
    {
        SherpaOnnxDestroyOfflineTts(_handle.Handle);
        _handle = new HandleRef(this, IntPtr.Zero);
    }

    [DllImport("sherpa-onnx-c-api")]
    private static extern IntPtr SherpaOnnxCreateOfflineTts(ref OfflineTtsConfig config);

    [DllImport("sherpa-onnx-c-api")]
    private static extern void SherpaOnnxDestroyOfflineTts(IntPtr handle);

    [DllImport("sherpa-onnx-c-api")]
    private static extern int SherpaOnnxOfflineTtsSampleRate(IntPtr handle);

    [DllImport("sherpa-onnx-c-api")]
    private static extern int SherpaOnnxOfflineTtsNumSpeakers(IntPtr handle);

    [DllImport("sherpa-onnx-c-api")]
    private static extern IntPtr SherpaOnnxOfflineTtsGenerate(IntPtr handle, [MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I1)] byte[] utf8Text, int sid, float speed);

    [DllImport("sherpa-onnx-c-api", CallingConvention = CallingConvention.Cdecl)]
    private static extern IntPtr SherpaOnnxOfflineTtsGenerateWithCallback(IntPtr handle, [MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I1)] byte[] utf8Text, int sid, float speed, OfflineTtsCallback callback);
}
#if false // Decompilation log
'238' items in cache
------------------
Resolve: 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
Found single assembly: 'netstandard, Version=2.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
WARN: Version mismatch. Expected: '2.0.0.0', Got: '2.1.0.0'
Load from: 'D:\Unity Installs\2022.3.55f1\Editor\Data\NetStandard\ref\2.1.0\netstandard.dll'
------------------
Resolve: 'System.Runtime.InteropServices, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null'
Found single assembly: 'System.Runtime.InteropServices, Version=4.1.2.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
WARN: Version mismatch. Expected: '2.0.0.0', Got: '4.1.2.0'
Load from: 'D:\Unity Installs\2022.3.55f1\Editor\Data\NetStandard\compat\2.1.0\shims\netstandard\System.Runtime.InteropServices.dll'
------------------
Resolve: 'System.Runtime.CompilerServices.Unsafe, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null'
Could not find by name: 'System.Runtime.CompilerServices.Unsafe, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null'
------------------
Resolve: 'netstandard, Version=2.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
Found single assembly: 'netstandard, Version=2.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
Load from: 'D:\Unity Installs\2022.3.55f1\Editor\Data\NetStandard\ref\2.1.0\netstandard.dll'
#endif

adem-rguez · 2025-01-10T10:25:48Z

hello @csukuangfj, any solution yet?

csukuangfj · 2025-01-10T11:25:22Z

The code looks correct.

Can you reproduce it with our example code in the dotnet-examples folder?

adem-rguez · 2025-01-10T11:58:03Z

sorry but i wasn't able to do that, i kinda lack experience of coding outside of unity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTS Hallucinations in shorter phrases #1695

TTS Hallucinations in shorter phrases #1695

adem-rguez commented Jan 8, 2025 •

edited

Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 •

edited

Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 •

edited

Loading

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 •

edited

Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 •

edited

Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 9, 2025

adem-rguez commented Jan 9, 2025 •

edited

Loading

adem-rguez commented Jan 10, 2025

csukuangfj commented Jan 10, 2025

adem-rguez commented Jan 10, 2025

TTS Hallucinations in shorter phrases #1695

TTS Hallucinations in shorter phrases #1695

Comments

adem-rguez commented Jan 8, 2025 • edited Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 • edited Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 • edited Loading

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 • edited Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025 • edited Loading

csukuangfj commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

adem-rguez commented Jan 8, 2025

csukuangfj commented Jan 9, 2025

adem-rguez commented Jan 9, 2025 • edited Loading

adem-rguez commented Jan 10, 2025

csukuangfj commented Jan 10, 2025

adem-rguez commented Jan 10, 2025

adem-rguez commented Jan 8, 2025 •

edited

Loading

adem-rguez commented Jan 8, 2025 •

edited

Loading

adem-rguez commented Jan 8, 2025 •

edited

Loading

adem-rguez commented Jan 8, 2025 •

edited

Loading

adem-rguez commented Jan 8, 2025 •

edited

Loading

adem-rguez commented Jan 9, 2025 •

edited

Loading