-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short utterances don't trigger the VAD #984
Comments
Could the approach here resolve for deepgram at least? If STT is picking up an utterance + finalizing it, but VAD isn't, then we should just be using the is_final as a trigger to say something back. In my experience, deepgram can be a little slower than expected on some of these yes/no etc, but perhaps there's a way to temporarily lower the vad sensitivity while the latest deepgram event is an interim transcript or something. Just want to make sure we're using all the tools we have here. |
@nikcaryo-super, we've talked about two possible solutions to the problem. One of which is to use both the VAD and TranscriptionFrames in combination, as you're suggesting. This is a high priority issue. I expect that we'll start working on it soon. |
I was going to open a separate issue to suggest that Silero VAD's training might not take into account the frequency filtering of PSTN phone audio. I think I'll just mention it here as part of an overall effort to improve VAD performance, and/or work around it with transcriptions. |
I hope this helps as a temporary solution:
async def on_utterance_end(self, stt, *args, **kwargs):
logger.info(f"Utterance ended with stt: {stt} and args: {args} and kwargs: {kwargs}")
await self.task.queue_frames([UtteranceEndFrame()])
async def on_speech_started(self, stt, *args, **kwargs):
# We don't want to interrupt the user when they start speaking so BotInterruptionFrame() is not queued
# Use SileroVAD for this (we just want to aggregate the TextFrames)
logger.info("User started speaking from STT")
await self.task.queue_frames([UserStartedSpeakingFrame()]) and then in the async def process_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
if isinstance(frame, UtteranceEndFrame) and len(self._aggregation.split()) < 2:
await self.process_frame(UserStoppedSpeakingFrame(), direction=FrameDirection.DOWNSTREAM)
# We might want to queue some interruption frames to interrupt the ai
await self.process_frame(StopInterruptionFrame())
await super().process_frame(frame, direction) Note that |
Hey, I just want to flag that the short-term fix of shortening the VAD 'start_secs' doesn't seem to be working on my end. I moved it down to 0.01. The transcriber transcribes, but it still doesn't track that a user has spoken. @balalofernandez that looks interesting...I don't think I'm expert enough to employ, however. |
@balalofernandez i have tried this few weeks back, but the problem with this is, speech started event is not that reliable, even when there is no audio, it would trigger the event in timely manner, and causing more unexpected interruptions, let me know if it works well for u on a larger sample set. |
You're right, we had to build resiliency around that. What if you had a two-buffer approach, one where you store all the transcripts from the We are still testing our solution tho. |
Ive been speaking with @markbackman over Discord on this issue. I have a solution Im working on but unfortunately it only works for Twilio. Twilio will send us "mark" events when audio has finished playing. My implementation uses these to figure out when the bot has stopped speaking. For how Im using Pipecat at the moment, I don't have interruptions turned on. So only take user transcriptions when the bot is not talking. The VAD delay that was causing delayed BotStoppedTalking events was what was causing us to miss first words as the event was 0.8s behind. Mentioned also on Discord. Im not sure VAD is a full proof solution due to there is always a delay. It has to get sufficient silence to know the bot has stopped speaking. Additionally, it feels a bit wasteful on system resources, even if minimal, to run VAD on audio that we generate. We know the durations of the audio that we are playing and we also know when we have stopped sending audio to play. I think if we could calculate latency between sending audio and it actually being played then we could calculate when the Bot has stopped speaking. |
Description
We've received a number of reports that short utterances like, "OK", "Yes, "No" aren't "heard" by the bot. The root cause for this issue is that the VAD is not triggered. The default value to detect speech is a
start_secs
duration of 0.2 seconds or longer.As a quick fix, the VAD's
start_secs
value can be lowered to 0.15 or even 0.1 seconds. But, this may result in unintended consequences like interruptions triggering unexpectedly.As a resolution to this issue, we want to experiment with different solutions to ensure that interruptions happen only with speech and that short utterances are transcribed and result in input to the LLM.
The text was updated successfully, but these errors were encountered: