Short utterances don't trigger the VAD #984

markbackman · 2025-01-14T02:37:25Z

Description

We've received a number of reports that short utterances like, "OK", "Yes, "No" aren't "heard" by the bot. The root cause for this issue is that the VAD is not triggered. The default value to detect speech is a start_secs duration of 0.2 seconds or longer.

As a quick fix, the VAD's start_secs value can be lowered to 0.15 or even 0.1 seconds. But, this may result in unintended consequences like interruptions triggering unexpectedly.

As a resolution to this issue, we want to experiment with different solutions to ensure that interruptions happen only with speech and that short utterances are transcribed and result in input to the LLM.

The text was updated successfully, but these errors were encountered:

nikcaryo-super · 2025-01-15T00:31:39Z

Could the approach here resolve for deepgram at least?

#455 (comment)

If STT is picking up an utterance + finalizing it, but VAD isn't, then we should just be using the is_final as a trigger to say something back. In my experience, deepgram can be a little slower than expected on some of these yes/no etc, but perhaps there's a way to temporarily lower the vad sensitivity while the latest deepgram event is an interim transcript or something. Just want to make sure we're using all the tools we have here.

markbackman · 2025-01-15T00:55:24Z

@nikcaryo-super, we've talked about two possible solutions to the problem. One of which is to use both the VAD and TranscriptionFrames in combination, as you're suggesting. This is a high priority issue. I expect that we'll start working on it soon.

chadbailey59 · 2025-01-20T16:15:58Z

I was going to open a separate issue to suggest that Silero VAD's training might not take into account the frequency filtering of PSTN phone audio. I think I'll just mention it here as part of an overall effort to improve VAD performance, and/or work around it with transcriptions.

balalofernandez · 2025-01-22T13:37:39Z

I hope this helps as a temporary solution:

We use deepgram's VAD and register these two event handlers:

async def on_utterance_end(self, stt, *args, **kwargs):
    logger.info(f"Utterance ended with stt: {stt} and args: {args} and kwargs: {kwargs}")
    await self.task.queue_frames([UtteranceEndFrame()])

async def on_speech_started(self, stt, *args, **kwargs):
    # We don't want to interrupt the user when they start speaking so BotInterruptionFrame() is not queued
    # Use SileroVAD for this (we just want to aggregate the TextFrames)
    logger.info("User started speaking from STT")
    await self.task.queue_frames([UserStartedSpeakingFrame()])

and then in the UserContextAggregator we check if the interaction has been short (less than 2 words):

async def process_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
    if isinstance(frame, UtteranceEndFrame) and len(self._aggregation.split()) < 2:
        await self.process_frame(UserStoppedSpeakingFrame(), direction=FrameDirection.DOWNSTREAM)
        # We might want to queue some interruption frames to interrupt the ai
        await self.process_frame(StopInterruptionFrame())
    await super().process_frame(frame, direction)

Note that UtteranceEndFrame is a new frame created to distinguish between when VAD sends a UserStoppedSpeakingFrame or when it comes from the handlers.

jonkatz · 2025-01-23T01:10:25Z

Hey, I just want to flag that the short-term fix of shortening the VAD 'start_secs' doesn't seem to be working on my end. I moved it down to 0.01. The transcriber transcribes, but it still doesn't track that a user has spoken.

@balalofernandez that looks interesting...I don't think I'm expert enough to employ, however.

Vaibhav-Lodha · 2025-01-23T11:03:06Z

@balalofernandez i have tried this few weeks back, but the problem with this is, speech started event is not that reliable, even when there is no audio, it would trigger the event in timely manner, and causing more unexpected interruptions, let me know if it works well for u on a larger sample set.

balalofernandez · 2025-01-23T11:33:45Z

You're right, we had to build resiliency around that. What if you had a two-buffer approach, one where you store all the transcripts from the UserStartedSpeakingFrame that comes from Deepgram VAD until you receive a UtteranceEndFrame and then check if this buffer contains the user's response? Otherwise, just stick to Silero VAD.

We are still testing our solution tho.

jcbjoe · 2025-01-23T15:59:45Z

Ive been speaking with @markbackman over Discord on this issue. I have a solution Im working on but unfortunately it only works for Twilio. Twilio will send us "mark" events when audio has finished playing. My implementation uses these to figure out when the bot has stopped speaking. For how Im using Pipecat at the moment, I don't have interruptions turned on. So only take user transcriptions when the bot is not talking. The VAD delay that was causing delayed BotStoppedTalking events was what was causing us to miss first words as the event was 0.8s behind.

Mentioned also on Discord. Im not sure VAD is a full proof solution due to there is always a delay. It has to get sufficient silence to know the bot has stopped speaking. Additionally, it feels a bit wasteful on system resources, even if minimal, to run VAD on audio that we generate. We know the durations of the audio that we are playing and we also know when we have stopped sending audio to play. I think if we could calculate latency between sending audio and it actually being played then we could calculate when the Bot has stopped speaking.

#1073

markbackman added the enhancement label Jan 14, 2025

chadbailey59 removed the enhancement label Jan 16, 2025

chadbailey59 assigned aconchillo Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short utterances don't trigger the VAD #984

Short utterances don't trigger the VAD #984

markbackman commented Jan 14, 2025

nikcaryo-super commented Jan 15, 2025

markbackman commented Jan 15, 2025

chadbailey59 commented Jan 20, 2025

balalofernandez commented Jan 22, 2025 •

edited

Loading

jonkatz commented Jan 23, 2025

Vaibhav-Lodha commented Jan 23, 2025

balalofernandez commented Jan 23, 2025 •

edited

Loading

jcbjoe commented Jan 23, 2025

Short utterances don't trigger the VAD #984

Short utterances don't trigger the VAD #984

Comments

markbackman commented Jan 14, 2025

Description

nikcaryo-super commented Jan 15, 2025

markbackman commented Jan 15, 2025

chadbailey59 commented Jan 20, 2025

balalofernandez commented Jan 22, 2025 • edited Loading

jonkatz commented Jan 23, 2025

Vaibhav-Lodha commented Jan 23, 2025

balalofernandez commented Jan 23, 2025 • edited Loading

jcbjoe commented Jan 23, 2025

balalofernandez commented Jan 22, 2025 •

edited

Loading

balalofernandez commented Jan 23, 2025 •

edited

Loading