Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short utterances don't trigger the VAD #984

Open
markbackman opened this issue Jan 14, 2025 · 8 comments
Open

Short utterances don't trigger the VAD #984

markbackman opened this issue Jan 14, 2025 · 8 comments
Assignees

Comments

@markbackman
Copy link
Contributor

Description

We've received a number of reports that short utterances like, "OK", "Yes, "No" aren't "heard" by the bot. The root cause for this issue is that the VAD is not triggered. The default value to detect speech is a start_secs duration of 0.2 seconds or longer.

As a quick fix, the VAD's start_secs value can be lowered to 0.15 or even 0.1 seconds. But, this may result in unintended consequences like interruptions triggering unexpectedly.

As a resolution to this issue, we want to experiment with different solutions to ensure that interruptions happen only with speech and that short utterances are transcribed and result in input to the LLM.

@nikcaryo-super
Copy link

Could the approach here resolve for deepgram at least?

#455 (comment)

If STT is picking up an utterance + finalizing it, but VAD isn't, then we should just be using the is_final as a trigger to say something back. In my experience, deepgram can be a little slower than expected on some of these yes/no etc, but perhaps there's a way to temporarily lower the vad sensitivity while the latest deepgram event is an interim transcript or something. Just want to make sure we're using all the tools we have here.

@markbackman
Copy link
Contributor Author

@nikcaryo-super, we've talked about two possible solutions to the problem. One of which is to use both the VAD and TranscriptionFrames in combination, as you're suggesting. This is a high priority issue. I expect that we'll start working on it soon.

@chadbailey59
Copy link
Contributor

I was going to open a separate issue to suggest that Silero VAD's training might not take into account the frequency filtering of PSTN phone audio. I think I'll just mention it here as part of an overall effort to improve VAD performance, and/or work around it with transcriptions.

@balalofernandez
Copy link
Contributor

balalofernandez commented Jan 22, 2025

I hope this helps as a temporary solution:

  • We use deepgram's VAD and register these two event handlers:
async def on_utterance_end(self, stt, *args, **kwargs):
    logger.info(f"Utterance ended with stt: {stt} and args: {args} and kwargs: {kwargs}")
    await self.task.queue_frames([UtteranceEndFrame()])

async def on_speech_started(self, stt, *args, **kwargs):
    # We don't want to interrupt the user when they start speaking so BotInterruptionFrame() is not queued
    # Use SileroVAD for this (we just want to aggregate the TextFrames)
    logger.info("User started speaking from STT")
    await self.task.queue_frames([UserStartedSpeakingFrame()])

and then in the UserContextAggregator we check if the interaction has been short (less than 2 words):

async def process_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
    if isinstance(frame, UtteranceEndFrame) and len(self._aggregation.split()) < 2:
        await self.process_frame(UserStoppedSpeakingFrame(), direction=FrameDirection.DOWNSTREAM)
        # We might want to queue some interruption frames to interrupt the ai
        await self.process_frame(StopInterruptionFrame())
    await super().process_frame(frame, direction)

Note that UtteranceEndFrame is a new frame created to distinguish between when VAD sends a UserStoppedSpeakingFrame or when it comes from the handlers.

@jonkatz
Copy link

jonkatz commented Jan 23, 2025

Hey, I just want to flag that the short-term fix of shortening the VAD 'start_secs' doesn't seem to be working on my end. I moved it down to 0.01. The transcriber transcribes, but it still doesn't track that a user has spoken.

@balalofernandez that looks interesting...I don't think I'm expert enough to employ, however.

@Vaibhav-Lodha
Copy link
Contributor

@balalofernandez i have tried this few weeks back, but the problem with this is, speech started event is not that reliable, even when there is no audio, it would trigger the event in timely manner, and causing more unexpected interruptions, let me know if it works well for u on a larger sample set.

@balalofernandez
Copy link
Contributor

balalofernandez commented Jan 23, 2025

You're right, we had to build resiliency around that. What if you had a two-buffer approach, one where you store all the transcripts from the UserStartedSpeakingFrame that comes from Deepgram VAD until you receive a UtteranceEndFrame and then check if this buffer contains the user's response? Otherwise, just stick to Silero VAD.

We are still testing our solution tho.

@jcbjoe
Copy link
Contributor

jcbjoe commented Jan 23, 2025

Ive been speaking with @markbackman over Discord on this issue. I have a solution Im working on but unfortunately it only works for Twilio. Twilio will send us "mark" events when audio has finished playing. My implementation uses these to figure out when the bot has stopped speaking. For how Im using Pipecat at the moment, I don't have interruptions turned on. So only take user transcriptions when the bot is not talking. The VAD delay that was causing delayed BotStoppedTalking events was what was causing us to miss first words as the event was 0.8s behind.

Mentioned also on Discord. Im not sure VAD is a full proof solution due to there is always a delay. It has to get sufficient silence to know the bot has stopped speaking. Additionally, it feels a bit wasteful on system resources, even if minimal, to run VAD on audio that we generate. We know the durations of the audio that we are playing and we also know when we have stopped sending audio to play. I think if we could calculate latency between sending audio and it actually being played then we could calculate when the Bot has stopped speaking.

#1073

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants