Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding text chunker in 11labs #1013

Closed

Conversation

Vaibhav159
Copy link
Contributor

Changelog

  • Added text_chunker utility:
    • Introduced a new utility to handle chunked text processing.
    • Integrated text_chunker into ElevenLabsTTSService to support sending text word by word, as recommended in the ElevenLabs documentation.
    • This change is a potential fix for issue ElevenLabs exception: sent 1009 (message too big) #983.

Motivation

The change is inspired by ElevenLabs' recommendation to use input streaming via WebSocket for text-to-speech applications. As per their documentation:

"For applications where the text prompts can be streamed to the text-to-speech endpoints (such as LLM output), this allows for prompts to be fed to the endpoint while the speech is being generated. You can also configure the streaming chunk size when using the WebSocket, with smaller chunks generally rendering faster. As such, we recommend sending content word by word. Our model and tooling leverage context to ensure that sentence structure and more are persisted in the generated audio, even if we only receive a word at a time."

This implementation ensures faster rendering and better alignment with ElevenLabs' best practices, while also addressing potential issues like #983.

@Vaibhav159
Copy link
Contributor Author

@markbackman can we review this in terms of #983 as well?

@markbackman
Copy link
Contributor

@Vaibhav159 I don't think the 1009 issue is actually due to the messages being too large. I've tested ElevenLabs with extremely large messages and haven't seen issues. They support up to 40K token TTS generations without issues. This can be tested by pushing a TTSSpeakFrame with a very long text message.

Also, when the text is processed, there's already logic that splits on the sentence boundary.

I don't think further splitting is the way to go.

Also, we've talked with the 11Labs team about how the text should be sent. They recommended enabling auto_mode since we're sending full sentences to them. This change greatly improves the latency. Docs on auto_mode:

auto_mode
This parameter focuses on reducing the latency by disabling the chunk schedule and all buffers. It is only recommended when sending full sentences or phrases, sending partial phrases will result in highly reduced quality. By default it’s set to false.

Given all of this, I'm inclined to say that we shouldn't make this change.

WDYT?

@Vaibhav159
Copy link
Contributor Author

@markbackman makes sense, only case would be auto_mode as false where we might need it but considering that, if we are ending text sentence by sentence we are better off not complicating the send logic

@Vaibhav159 Vaibhav159 closed this Jan 16, 2025
@markbackman
Copy link
Contributor

Thanks @Vaibhav159. If you are able to repro the 1009 websocket error, I'm very interested to know what that repro case is. I'm hopeful that the receive_task_handler improvements made in #962 will solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants