[C++] Questions Why python and c++ time stamps are different? #533

NathanJHLee · 2024-08-28T05:36:45Z

❓ Questions and Help

Hi silero team!
When i try to use silero-vad using python, I felt it is good.
But if i use silero-vad using c++, i got quite different result between python and c++.

I prepared silero-vad 5.1(pip) and c++ build( silero-vad-master downloaded on 2024-08-26) respectively.

#Test samle file. Voxconverse data
[asr1@k-atc12 cpp]$ sox --i voxconverse_data/dev/audio/afjiv.wav

Input File : 'voxconverse_data/dev/audio/afjiv.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:02:31.25 = 2419968 samples ~ 11343.6 CDDA sectors
File Size : 4.84M
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

sha256sum ~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx
2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx

#in Python.

from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad(True) #Changed it using ONNX model "True"
wav = read_audio('/ws/stt/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')
speech_timestamps = get_speech_timestamps(wav, model)
for timestamp in speech_timestamps:
... print(timestamp)
...
{'start': 84512, 'end': 474592}
{'start': 476192, 'end': 506848}
{'start': 509984, 'end': 548320}
{'start': 554528, 'end': 686048}
{'start': 688672, 'end': 787936}
{'start': 789536, 'end': 826848}
{'start': 829472, 'end': 847328}
{'start': 848928, 'end': 859616}
{'start': 862240, 'end': 1046496}
{'start': 1048096, 'end': 1068000}
{'start': 1071136, 'end': 1341408}
{'start': 1357344, 'end': 1379296}
{'start': 1392160, 'end': 1408992}
{'start': 1418784, 'end': 1427936}
{'start': 1431584, 'end': 1485280}
{'start': 1488928, 'end': 1511904}
{'start': 1520672, 'end': 1569248}
{'start': 1578016, 'end': 1610208}
{'start': 1617440, 'end': 1651168}
{'start': 1653280, 'end': 1675744}
{'start': 1686048, 'end': 1710048}
{'start': 1715232, 'end': 1726432}
{'start': 1730080, 'end': 1751008}
{'start': 1753120, 'end': 1773536}
{'start': 1776160, 'end': 1791968}
{'start': 1795104, 'end': 1813984}
{'start': 1820192, 'end': 1860576}
{'start': 1869344, 'end': 1907680}
{'start': 1909280, 'end': 1959392}
{'start': 1966624, 'end': 1989088}
{'start': 2002976, 'end': 2050016}
{'start': 2055712, 'end': 2077152}
{'start': 2093600, 'end': 2132448}
{'start': 2138656, 'end': 2147808}
{'start': 2169888, 'end': 2211296}
{'start': 2222112, 'end': 2244064}
{'start': 2249760, 'end': 2267616}
{'start': 2271264, 'end': 2302944}
{'start': 2313760, 'end': 2327520}

#in c++ (Built by silrero-vad souce. I downloaded 'silero-vad-master' on 2024-08-26)
changed some parameter in 'silero-vad-master/examples/cpp/silero-vad-onnx.cpp'
float Threshold = 0.5,
int min_silence_duration_ms = 100,
int speech_pad_ms = 30,
int min_speech_duration_ms = 250,
#They are referred from '~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/utils_vad.py'

sha256sum "../../src/silero_vad/data/silero_vad.onnx"
2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f

./test
[asr1@k-atc12 cpp]$ ./test
num_channel_ :1
sample_rate_ :16000
bits_per_sample_:16
num_samples :2419968
num_data_size :4839936
{start:00019456,end:00200192}
{start:00202752,end:00258048}
{start:00261120,end:00400384}
{start:00403456,end:00473600}
{start:00477184,end:00506880}
{start:00510976,end:00548864}
{start:00555520,end:00637952}
{start:00642560,end:00686592}
{start:00689152,end:00727552}
{start:00729600,end:00787456}
{start:00790016,end:00826880}
{start:00829952,end:00846848}
{start:00849920,end:00858112}
{start:00863232,end:01068032}
{start:01071616,end:01083904}
{start:01088000,end:01289216}
{start:01295360,end:01311744}
{start:01314816,end:01324032}
{start:01326592,end:01340928}
{start:01357824,end:01378816}
{start:01394688,end:01408512}
{start:01420288,end:01427968}
{start:01432576,end:01484800}
{start:01491456,end:01510912}
{start:01521152,end:01569280}
{start:01578496,end:01609216}
{start:01619456,end:01625088}
{start:01627648,end:01650176}
{start:01655296,end:01676288}
{start:01687040,end:01710080}
{start:01716224,end:01724928}
{start:01731072,end:01750528}
{start:01754112,end:01762304}
{start:01765888,end:01772544}
{start:01777664,end:01790976}
{start:01796608,end:01813504}
{start:01821184,end:01859072}
{start:01873408,end:01906176}
{start:01910272,end:01923072}
{start:01926144,end:01959936}
{start:01967616,end:01989120}
{start:02003968,end:02050048}
{start:02058752,end:02076160}
{start:02094592,end:02114048}
{start:02116608,end:02131968}
{start:02170880,end:02191872}
{start:02195456,end:02211840}
{start:02223104,end:02244096}
{start:02250240,end:02267648}
{start:02272256,end:02303488}
{start:02314752,end:02327552}

I check both of onnx model checksum code. They are same.
Any clues?
Thank you.

snakers4 · 2024-08-28T05:43:33Z

Hi,

I check both of onnx model checksum code. They are same.

The next logical step would be to compare the raw probabilities output by the python code and c++ code.

If they are the same - then it's post-processing. If not - it's onnx_runtime.

You see, the c++ example is community contributed, we did not debug it.

snakers4 · 2024-08-28T05:44:52Z

Also a standard suggestion, plot the probablities for both implementations side by side with an audio envelope and probably with some marker for the speech segments, that would help debug.

NathanJHLee · 2024-08-28T06:40:50Z

oh i see. I thought your open c++ code is guranteed.
Do you have plan to release official c++ code in the future?

snakers4 · 2024-08-28T06:46:08Z

oh i see. I thought your open c++ code is guranteed.

All examples are community-generated.
PRs are appreciated to fix bugs.

Do you have plan to release official c++ code in the future?

Not yet.

smallsheep666 · 2024-08-28T08:08:56Z

I have found that using the same input,for example all zeros,after reset_states() the onnx model output is different from pytorch model.
the onnx model's speech prob is 0.044 however the pytorch model's speech prob is 0.012.

snakers4 · 2024-08-28T08:19:38Z

jit and onnx have slightly different input formats, most likely this is the reason

smallsheep666 · 2024-08-28T08:23:58Z

I use the same parameter，the vad result may have a lot of different. May have a few more pieces

NathanJHLee · 2024-08-29T02:41:50Z

@smallsheep666
I also will check both probs between pytorch and onnx and let you know later.
Thank you

smallsheep666 · 2024-08-30T08:04:19Z

I have found why there is a huge probs different between c++ and pytroch.
When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples).
After add context samples before using onnxruntime,the result is same.
Is a bug in c++.@NathanJHLee

snakers4 · 2024-08-30T08:06:53Z

When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples).
After add context samples before using onnxruntime,the result is same.
Is a bug in c++.@NathanJHLee

Looks like a c++ wrapper for a previous version of the model (v4 or v3.1 ). I believe there was a PR to fix that.

NathanJHLee · 2024-09-03T02:14:09Z

I compared output probs from torch(python), Onnx(python) and Onnxruntime(c++) 3 types.
I also got same results with @smallsheep666 .
Onnx(c++) has problem to get probs.
torch(python), Onnx(python) both are showing same probs.

My test is below
silero-vad 5.1
torch 2.4.0

locate(silero-vad-master/src/silero_vad)
1. torch(python)
import sys
import torch
sys.path.append('~/workspace/silero-vad-master/src/silero_vad')

from utils_vad import read_audio
from utils_vad import get_speech_timestamps

model = torch.jit.load('data/silero_vad.jit')

audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, 16000).item()
print(speech_prob)

2. Onnx(python)
import sys
sys.path.append('~/workspace/silero-vad-master/src/silero_vad')

from utils_vad import OnnxWrapper
from utils_vad import read_audio
from utils_vad import get_speech_timestamps

model = OnnxWrapper('data/silero_vad.onnx',force_onnx_cpu=True)
audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, 16000).item()
print(speech_prob)

3. Onnx(c++)
I followd your instruction.(silero-vad-master/examples/cpp/README.md)
And add std::cout as below
159 float speech_prob = ort_outputs[0].GetTensorMutableData()[0];
160 std::cout<<"prob : "<< speech_prob << std::endl;
161 float *stateN = ort_outputs[1].GetTensorMutableData();

probs 1,2 are same as first coulm and 3 is second coulm.

0.01201203465461731 0.0442627
0.007816523313522339 0.0336125
0.005424141883850098 0.0221236
0.032478004693984985 0.0149333
0.023117244243621826 0.0122732
0.030117541551589966 0.00846022
0.05572396516799927 0.00648624
0.06487590074539185 0.0289339
0.046058326959609985 0.03056
0.039179474115371704 0.0349256
0.030434370040893555 0.0270224
0.027803152799606323 0.0505134
0.01884964108467102 0.0558349
0.012964963912963867 0.0883535
0.014463871717453003 0.0743203
0.0173836350440979 0.0492192
0.014908134937286377 0.0641958
0.010565102100372314 0.238843
0.00588575005531311 0.160749
0.00439077615737915 0.20331624
.
.
.
.
.

To get right probs, I made c++ source code base on libtorch(torch script).
I got right result finally. Now i have to add some codes to control probs to activate start and end part .
If I succeed, then I will let you know.
thank you.

NathanJHLee · 2024-09-05T06:12:02Z

I found an issue that jit model couldn't use reset_states. Then model show difference probs.

for example.

import sys
import torch
sys.path.append('/home/silero-vad-master/src/silero_vad')

from utils_vad import read_audio
from utils_vad import get_speech_timestamps
from utils_vad import init_jit_model

model = torch.jit.load('data/silero_vad.jit')

audio = read_audio('DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
    chunk = audio[current_start_sample: current_start_sample + window_size_samples]
    if len(chunk) < window_size_samples:
        chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
    speech_prob = model(chunk, 16000).item()
    print(speech_prob)

**#try to get next input wav.** I use same audio one more time for test

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
    chunk = audio[current_start_sample: current_start_sample + window_size_samples]
    if len(chunk) < window_size_samples:
        chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
    speech_prob = model(chunk, 16000).item()
    print(speech_prob)

Fisrt and Second probs show different result, even though i use same input data.
So i added 'model.reset_states()' on yours before retrying.
Then it works fine.

So i also wanna use 'model.reset_states()' to for silero-vad.jit but it only can be use 'forward()'.

According to 'silero-vad-master/src/silero_vad/vim utils_vad.py'

    def reset_states(self, batch_size=1):
        self._state = torch.zeros((2, batch_size, 128)).float()
        self._context = torch.zeros(0)
        self._last_sr = 0
        self._last_batch_size = 0

I don't know how to call 'reset_states'.
So When finished first wav file, I try to apply zero-pad value to be reset model as below.

torch::Tensor chunk = torch::zeros({batch_size, 512}, torch::kFloat32);     //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);    
inputs.push_back(16000); 
tensor::Tensor output = model.forward(inputs).toTensor();

But above code couldn't solve problem. I found one thing that i don't understand.
When i use sample_rate and window_sample_size to 8000 and 256 respectively.
i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env. I think it's not good idea.
Nevertheless, model is reset correctly

torch::Tensor chunk = torch::zeros({batch_size, 256}, torch::kFloat32);     //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);    
inputs.push_back(8000); 
tensor::Tensor output = model.forward(inputs).toTensor();

So, I wanna know right way 'reset_states' correctly. I searched for a related issue and found one.
on first question.
On the other hand, the function 'reset_states' of the jit model can't be used in C++ code, can you provide a name so we can use it like this 'model.get_method('name')()' or 'model.run_method('name')'

Thank you.

snakers4 · 2024-09-05T06:37:58Z

    def reset_states(self, batch_size=1):
        self._state = torch.zeros((2, batch_size, 128)).float()
        self._context = torch.zeros(0)
        self._last_sr = 0
        self._last_batch_size = 0

This is a method from the ONNX wrapper, where states are reset manually.
It is better to stick to the ONNX implementation if you cannot run full torchscript in Python.
In Python with jit everything is handled inside of the model.
With ONNX states are to be reset manually as shown in the ONNX wrapper.

i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env.

The first inference does not require 100ms. It requires zero state, zero padding and the audio chunk itself.

silero-vad/src/silero_vad/utils_vad.py

Lines 63 to 80 in 46f94b7

    
           if not self._last_batch_size: 
        
               self.reset_states(batch_size) 
        
           if (self._last_sr) and (self._last_sr != sr): 
        
               self.reset_states(batch_size) 
        
           if (self._last_batch_size) and (self._last_batch_size != batch_size): 
        
               self.reset_states(batch_size) 
        
           if not len(self._context): 
        
               self._context = torch.zeros(batch_size, context_size) 
        
           x = torch.cat([self._context, x], dim=1) 
        
           if sr in [8000, 16000]: 
        
               ort_inputs = {'input': x.numpy(), 'state': self._state.numpy(), 'sr': np.array(sr, dtype='int64')} 
        
               ort_outs = self.session.run(None, ort_inputs) 
        
               out, state = ort_outs 
        
               self._state = torch.from_numpy(state) 
        
           else: 
        
               raise ValueError()

NathanJHLee · 2024-09-05T07:07:08Z

oh sorry i missed your model.jit function

model.run_method("reset_states");
It works fine for me. Model is reset correctly.

huxiaoyuqn · 2024-09-06T12:07:58Z

In order to be consistent with python, I added these contents at the beginning of the predict function in C++：std::vector<float> new_data(data.size()+64,0.0f); std::copy(data.begin(), data.end(), new_data.begin() + 64); input.assign(new_data.begin(), new_data.end());
But the speech_prob and the timestamp are still different from python's

huxiaoyuqn · 2024-09-07T10:48:05Z

I debugged carefully and found that there are three detailed differences between the C++ code and the python code:

As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.
The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.
The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).

Based on the above three points, I designed two vectors:

Create a new member variable tamp_data to inherit the last 64 elements of the previous data;
Create a new new_data in predict(). Its first 64 elements are copied from the elements of temp_data, and the remaining elements are from data. Use it to replace input as the input of input_ort;

If you understand the above points, you can modify the C++ code so that its detection results are consistent with those of Python.
Finally, I still have a confusing point:
the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data.
My English expression is very poor, please forgive me for translating into English.

snakers4 · 2024-09-07T11:33:05Z

As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.

The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.

The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).

This is should be done for a v5 model. I wonder if the C++ wrapper that was supposedly adapted for v5 (in this PR #482) has this change or some earlier version is discussed.

In any case, a v5 model simply would not work without these features.

the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data.

This function resets states in two places - "inside" of the model (since we cannot do it directly in ONNX inside of the model, we drag this state along in the interface, self._state) and it zeroes out the last 64 elements (self._context).

silero-vad/src/silero_vad/utils_vad.py

Lines 46 to 50 in 46f94b7

    
           def reset_states(self, batch_size=1): 
        
               self._state = torch.zeros((2, batch_size, 128)).float() 
        
               self._context = torch.zeros(0) 
        
               self._last_sr = 0 
        
               self._last_batch_size = 0

This function or its counterpart in the C++ code should be invoked:

At the start of the VAD session or alternatively at the end of the VAD session;
Or when there is any discountinuity in the data, i.e. track change / microphone change / source change / file change / channel change;

NathanJHLee · 2024-09-12T22:54:57Z

Hi @snakers4
Do you have plan to release jit model half? It means quantized model right?

snakers4 · 2024-09-13T03:22:20Z

Hi @snakers4 Do you have plan to release jit model half? It means quantized model right?

We used to have quantized models long time ago, bit there were many complaints that they did not run on some platforms. So we decided not to bother anymore since models are small.

NathanJHLee · 2024-09-24T01:20:24Z

Thank you for your answer.
Now i have tested batch inference ,but i got difference probs from silero model.
I though model have some cache to reproduce for next upcomming chunk.
Even though probs needs to check, batch inference shows much much much shorter latency results compared to the single inference. I think it's necessary.
So, I found your documentation that batch inference is possible on silero V3 version.
V5 also supports batch inference too? If yes, i would like to take a closer look.

NathanJHLee · 2024-11-22T06:55:27Z

Hi snakers4! Please check my PR as below.
#578

Thank you.

NathanJHLee added the help wanted Extra attention is needed label Aug 28, 2024

NathanJHLee assigned snakers4 Aug 28, 2024

github-staff deleted a comment from Superstar-IT Oct 1, 2024

snakers4 changed the title ~~❓ Questions Why python and c++ time stamps are different?~~ [C++] Questions Why python and c++ time stamps are different? Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Questions Why python and c++ time stamps are different? #533

[C++] Questions Why python and c++ time stamps are different? #533

NathanJHLee commented Aug 28, 2024

snakers4 commented Aug 28, 2024

snakers4 commented Aug 28, 2024

NathanJHLee commented Aug 28, 2024

snakers4 commented Aug 28, 2024

smallsheep666 commented Aug 28, 2024

snakers4 commented Aug 28, 2024

smallsheep666 commented Aug 28, 2024

NathanJHLee commented Aug 29, 2024

smallsheep666 commented Aug 30, 2024

snakers4 commented Aug 30, 2024

NathanJHLee commented Sep 3, 2024 •

edited

Loading

NathanJHLee commented Sep 5, 2024 •

edited

Loading

snakers4 commented Sep 5, 2024 •

edited

Loading

NathanJHLee commented Sep 5, 2024

huxiaoyuqn commented Sep 6, 2024

huxiaoyuqn commented Sep 7, 2024

snakers4 commented Sep 7, 2024 •

edited

Loading

NathanJHLee commented Sep 12, 2024

snakers4 commented Sep 13, 2024

NathanJHLee commented Sep 24, 2024 •

edited

Loading

NathanJHLee commented Nov 22, 2024

[C++] Questions Why python and c++ time stamps are different? #533

[C++] Questions Why python and c++ time stamps are different? #533

Comments

NathanJHLee commented Aug 28, 2024

❓ Questions and Help

snakers4 commented Aug 28, 2024

snakers4 commented Aug 28, 2024

NathanJHLee commented Aug 28, 2024

snakers4 commented Aug 28, 2024

smallsheep666 commented Aug 28, 2024

snakers4 commented Aug 28, 2024

smallsheep666 commented Aug 28, 2024

NathanJHLee commented Aug 29, 2024

smallsheep666 commented Aug 30, 2024

snakers4 commented Aug 30, 2024

NathanJHLee commented Sep 3, 2024 • edited Loading

NathanJHLee commented Sep 5, 2024 • edited Loading

snakers4 commented Sep 5, 2024 • edited Loading

NathanJHLee commented Sep 5, 2024

huxiaoyuqn commented Sep 6, 2024

huxiaoyuqn commented Sep 7, 2024

snakers4 commented Sep 7, 2024 • edited Loading

NathanJHLee commented Sep 12, 2024

snakers4 commented Sep 13, 2024

NathanJHLee commented Sep 24, 2024 • edited Loading

NathanJHLee commented Nov 22, 2024

NathanJHLee commented Sep 3, 2024 •

edited

Loading

NathanJHLee commented Sep 5, 2024 •

edited

Loading

snakers4 commented Sep 5, 2024 •

edited

Loading

snakers4 commented Sep 7, 2024 •

edited

Loading

NathanJHLee commented Sep 24, 2024 •

edited

Loading