Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Questions Why python and c++ time stamps are different? #533

Open
NathanJHLee opened this issue Aug 28, 2024 · 21 comments
Open

[C++] Questions Why python and c++ time stamps are different? #533

NathanJHLee opened this issue Aug 28, 2024 · 21 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@NathanJHLee
Copy link

❓ Questions and Help

Hi silero team!
When i try to use silero-vad using python, I felt it is good.
But if i use silero-vad using c++, i got quite different result between python and c++.

I prepared silero-vad 5.1(pip) and c++ build( silero-vad-master downloaded on 2024-08-26) respectively.

#Test samle file. Voxconverse data
[asr1@k-atc12 cpp]$ sox --i voxconverse_data/dev/audio/afjiv.wav

Input File : 'voxconverse_data/dev/audio/afjiv.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:02:31.25 = 2419968 samples ~ 11343.6 CDDA sectors
File Size : 4.84M
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

sha256sum ~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx
2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx

#in Python.

from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad(True) #Changed it using ONNX model "True"
wav = read_audio('/ws/stt/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')
speech_timestamps = get_speech_timestamps(wav, model)
for timestamp in speech_timestamps:
... print(timestamp)
...
{'start': 84512, 'end': 474592}
{'start': 476192, 'end': 506848}
{'start': 509984, 'end': 548320}
{'start': 554528, 'end': 686048}
{'start': 688672, 'end': 787936}
{'start': 789536, 'end': 826848}
{'start': 829472, 'end': 847328}
{'start': 848928, 'end': 859616}
{'start': 862240, 'end': 1046496}
{'start': 1048096, 'end': 1068000}
{'start': 1071136, 'end': 1341408}
{'start': 1357344, 'end': 1379296}
{'start': 1392160, 'end': 1408992}
{'start': 1418784, 'end': 1427936}
{'start': 1431584, 'end': 1485280}
{'start': 1488928, 'end': 1511904}
{'start': 1520672, 'end': 1569248}
{'start': 1578016, 'end': 1610208}
{'start': 1617440, 'end': 1651168}
{'start': 1653280, 'end': 1675744}
{'start': 1686048, 'end': 1710048}
{'start': 1715232, 'end': 1726432}
{'start': 1730080, 'end': 1751008}
{'start': 1753120, 'end': 1773536}
{'start': 1776160, 'end': 1791968}
{'start': 1795104, 'end': 1813984}
{'start': 1820192, 'end': 1860576}
{'start': 1869344, 'end': 1907680}
{'start': 1909280, 'end': 1959392}
{'start': 1966624, 'end': 1989088}
{'start': 2002976, 'end': 2050016}
{'start': 2055712, 'end': 2077152}
{'start': 2093600, 'end': 2132448}
{'start': 2138656, 'end': 2147808}
{'start': 2169888, 'end': 2211296}
{'start': 2222112, 'end': 2244064}
{'start': 2249760, 'end': 2267616}
{'start': 2271264, 'end': 2302944}
{'start': 2313760, 'end': 2327520}

#in c++ (Built by silrero-vad souce. I downloaded 'silero-vad-master' on 2024-08-26)
changed some parameter in 'silero-vad-master/examples/cpp/silero-vad-onnx.cpp'
float Threshold = 0.5,
int min_silence_duration_ms = 100,
int speech_pad_ms = 30,
int min_speech_duration_ms = 250,
#They are referred from '~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/utils_vad.py'

sha256sum "../../src/silero_vad/data/silero_vad.onnx"
2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f

./test
[asr1@k-atc12 cpp]$ ./test
num_channel_ :1
sample_rate_ :16000
bits_per_sample_:16
num_samples :2419968
num_data_size :4839936
{start:00019456,end:00200192}
{start:00202752,end:00258048}
{start:00261120,end:00400384}
{start:00403456,end:00473600}
{start:00477184,end:00506880}
{start:00510976,end:00548864}
{start:00555520,end:00637952}
{start:00642560,end:00686592}
{start:00689152,end:00727552}
{start:00729600,end:00787456}
{start:00790016,end:00826880}
{start:00829952,end:00846848}
{start:00849920,end:00858112}
{start:00863232,end:01068032}
{start:01071616,end:01083904}
{start:01088000,end:01289216}
{start:01295360,end:01311744}
{start:01314816,end:01324032}
{start:01326592,end:01340928}
{start:01357824,end:01378816}
{start:01394688,end:01408512}
{start:01420288,end:01427968}
{start:01432576,end:01484800}
{start:01491456,end:01510912}
{start:01521152,end:01569280}
{start:01578496,end:01609216}
{start:01619456,end:01625088}
{start:01627648,end:01650176}
{start:01655296,end:01676288}
{start:01687040,end:01710080}
{start:01716224,end:01724928}
{start:01731072,end:01750528}
{start:01754112,end:01762304}
{start:01765888,end:01772544}
{start:01777664,end:01790976}
{start:01796608,end:01813504}
{start:01821184,end:01859072}
{start:01873408,end:01906176}
{start:01910272,end:01923072}
{start:01926144,end:01959936}
{start:01967616,end:01989120}
{start:02003968,end:02050048}
{start:02058752,end:02076160}
{start:02094592,end:02114048}
{start:02116608,end:02131968}
{start:02170880,end:02191872}
{start:02195456,end:02211840}
{start:02223104,end:02244096}
{start:02250240,end:02267648}
{start:02272256,end:02303488}
{start:02314752,end:02327552}

I check both of onnx model checksum code. They are same.
Any clues?
Thank you.

@NathanJHLee NathanJHLee added the help wanted Extra attention is needed label Aug 28, 2024
@snakers4
Copy link
Owner

Hi,

I check both of onnx model checksum code. They are same.

The next logical step would be to compare the raw probabilities output by the python code and c++ code.

If they are the same - then it's post-processing. If not - it's onnx_runtime.

You see, the c++ example is community contributed, we did not debug it.

@snakers4
Copy link
Owner

Also a standard suggestion, plot the probablities for both implementations side by side with an audio envelope and probably with some marker for the speech segments, that would help debug.

@NathanJHLee
Copy link
Author

oh i see. I thought your open c++ code is guranteed.
Do you have plan to release official c++ code in the future?

@snakers4
Copy link
Owner

oh i see. I thought your open c++ code is guranteed.

All examples are community-generated.
PRs are appreciated to fix bugs.

Do you have plan to release official c++ code in the future?

Not yet.

@smallsheep666
Copy link

I have found that using the same input,for example all zeros,after reset_states() the onnx model output is different from pytorch model.
the onnx model's speech prob is 0.044 however the pytorch model's speech prob is 0.012.

@snakers4
Copy link
Owner

jit and onnx have slightly different input formats, most likely this is the reason

@smallsheep666
Copy link

I use the same parameter,the vad result may have a lot of different. May have a few more pieces

@NathanJHLee
Copy link
Author

@smallsheep666
I also will check both probs between pytorch and onnx and let you know later.
Thank you

@smallsheep666
Copy link

I have found why there is a huge probs different between c++ and pytroch.
When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples).
After add context samples before using onnxruntime,the result is same.
Is a bug in c++.@NathanJHLee

@snakers4
Copy link
Owner

When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples).
After add context samples before using onnxruntime,the result is same.
Is a bug in c++.@NathanJHLee

Looks like a c++ wrapper for a previous version of the model (v4 or v3.1 ). I believe there was a PR to fix that.

@NathanJHLee
Copy link
Author

NathanJHLee commented Sep 3, 2024

I compared output probs from torch(python), Onnx(python) and Onnxruntime(c++) 3 types.
I also got same results with @smallsheep666 .
Onnx(c++) has problem to get probs.
torch(python), Onnx(python) both are showing same probs.

My test is below
silero-vad 5.1
torch 2.4.0

locate(silero-vad-master/src/silero_vad)
1. torch(python)
import sys
import torch
sys.path.append('~/workspace/silero-vad-master/src/silero_vad')

from utils_vad import read_audio
from utils_vad import get_speech_timestamps

model = torch.jit.load('data/silero_vad.jit')

audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, 16000).item()
print(speech_prob)

2. Onnx(python)
import sys
sys.path.append('~/workspace/silero-vad-master/src/silero_vad')

from utils_vad import OnnxWrapper
from utils_vad import read_audio
from utils_vad import get_speech_timestamps

model = OnnxWrapper('data/silero_vad.onnx',force_onnx_cpu=True)
audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, 16000).item()
print(speech_prob)

3. Onnx(c++)
I followd your instruction.(silero-vad-master/examples/cpp/README.md)
And add std::cout as below
159 float speech_prob = ort_outputs[0].GetTensorMutableData()[0];
160 std::cout<<"prob : "<< speech_prob << std::endl;
161 float *stateN = ort_outputs[1].GetTensorMutableData();

probs 1,2 are same as first coulm and 3 is second coulm.

0.01201203465461731 0.0442627
0.007816523313522339 0.0336125
0.005424141883850098 0.0221236
0.032478004693984985 0.0149333
0.023117244243621826 0.0122732
0.030117541551589966 0.00846022
0.05572396516799927 0.00648624
0.06487590074539185 0.0289339
0.046058326959609985 0.03056
0.039179474115371704 0.0349256
0.030434370040893555 0.0270224
0.027803152799606323 0.0505134
0.01884964108467102 0.0558349
0.012964963912963867 0.0883535
0.014463871717453003 0.0743203
0.0173836350440979 0.0492192
0.014908134937286377 0.0641958
0.010565102100372314 0.238843
0.00588575005531311 0.160749
0.00439077615737915 0.20331624
.
.
.
.
.

To get right probs, I made c++ source code base on libtorch(torch script).
I got right result finally. Now i have to add some codes to control probs to activate start and end part .
If I succeed, then I will let you know.
thank you.

@NathanJHLee
Copy link
Author

NathanJHLee commented Sep 5, 2024

I found an issue that jit model couldn't use reset_states. Then model show difference probs.

for example.

import sys
import torch
sys.path.append('/home/silero-vad-master/src/silero_vad')

from utils_vad import read_audio
from utils_vad import get_speech_timestamps
from utils_vad import init_jit_model

model = torch.jit.load('data/silero_vad.jit')

audio = read_audio('DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
    chunk = audio[current_start_sample: current_start_sample + window_size_samples]
    if len(chunk) < window_size_samples:
        chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
    speech_prob = model(chunk, 16000).item()
    print(speech_prob)

**#try to get next input wav.** I use same audio one more time for test

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
    chunk = audio[current_start_sample: current_start_sample + window_size_samples]
    if len(chunk) < window_size_samples:
        chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
    speech_prob = model(chunk, 16000).item()
    print(speech_prob)

Fisrt and Second probs show different result, even though i use same input data.
So i added 'model.reset_states()' on yours before retrying.
Then it works fine.

So i also wanna use 'model.reset_states()' to for silero-vad.jit but it only can be use 'forward()'.

According to 'silero-vad-master/src/silero_vad/vim utils_vad.py'

    def reset_states(self, batch_size=1):
        self._state = torch.zeros((2, batch_size, 128)).float()
        self._context = torch.zeros(0)
        self._last_sr = 0
        self._last_batch_size = 0

I don't know how to call 'reset_states'.
So When finished first wav file, I try to apply zero-pad value to be reset model as below.

torch::Tensor chunk = torch::zeros({batch_size, 512}, torch::kFloat32);     //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);    
inputs.push_back(16000); 
tensor::Tensor output = model.forward(inputs).toTensor();

But above code couldn't solve problem. I found one thing that i don't understand.
When i use sample_rate and window_sample_size to 8000 and 256 respectively.
i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env. I think it's not good idea.
Nevertheless, model is reset correctly

torch::Tensor chunk = torch::zeros({batch_size, 256}, torch::kFloat32);     //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);    
inputs.push_back(8000); 
tensor::Tensor output = model.forward(inputs).toTensor();

So, I wanna know right way 'reset_states' correctly. I searched for a related issue and found one.
on first question.
On the other hand, the function 'reset_states' of the jit model can't be used in C++ code, can you provide a name so we can use it like this 'model.get_method('name')()' or 'model.run_method('name')'

Thank you.

@snakers4
Copy link
Owner

snakers4 commented Sep 5, 2024

    def reset_states(self, batch_size=1):
        self._state = torch.zeros((2, batch_size, 128)).float()
        self._context = torch.zeros(0)
        self._last_sr = 0
        self._last_batch_size = 0

This is a method from the ONNX wrapper, where states are reset manually.
It is better to stick to the ONNX implementation if you cannot run full torchscript in Python.
In Python with jit everything is handled inside of the model.
With ONNX states are to be reset manually as shown in the ONNX wrapper.

i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env.

The first inference does not require 100ms. It requires zero state, zero padding and the audio chunk itself.

if not self._last_batch_size:
self.reset_states(batch_size)
if (self._last_sr) and (self._last_sr != sr):
self.reset_states(batch_size)
if (self._last_batch_size) and (self._last_batch_size != batch_size):
self.reset_states(batch_size)
if not len(self._context):
self._context = torch.zeros(batch_size, context_size)
x = torch.cat([self._context, x], dim=1)
if sr in [8000, 16000]:
ort_inputs = {'input': x.numpy(), 'state': self._state.numpy(), 'sr': np.array(sr, dtype='int64')}
ort_outs = self.session.run(None, ort_inputs)
out, state = ort_outs
self._state = torch.from_numpy(state)
else:
raise ValueError()

@NathanJHLee
Copy link
Author

oh sorry i missed your model.jit function

model.run_method("reset_states");
It works fine for me. Model is reset correctly.

@huxiaoyuqn
Copy link

In order to be consistent with python, I added these contents at the beginning of the predict function in C++:std::vector<float> new_data(data.size()+64,0.0f); std::copy(data.begin(), data.end(), new_data.begin() + 64); input.assign(new_data.begin(), new_data.end());
But the speech_prob and the timestamp are still different from python's

@huxiaoyuqn
Copy link

I debugged carefully and found that there are three detailed differences between the C++ code and the python code:

  1. As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.
  2. The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.
  3. The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).

Based on the above three points, I designed two vectors:

  1. Create a new member variable tamp_data to inherit the last 64 elements of the previous data;
  2. Create a new new_data in predict(). Its first 64 elements are copied from the elements of temp_data, and the remaining elements are from data. Use it to replace input as the input of input_ort;

If you understand the above points, you can modify the C++ code so that its detection results are consistent with those of Python.
Finally, I still have a confusing point:
the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data.
My English expression is very poor, please forgive me for translating into English.

@snakers4
Copy link
Owner

snakers4 commented Sep 7, 2024

As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.

The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.

The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).

This is should be done for a v5 model. I wonder if the C++ wrapper that was supposedly adapted for v5 (in this PR #482) has this change or some earlier version is discussed.

In any case, a v5 model simply would not work without these features.

the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data.

This function resets states in two places - "inside" of the model (since we cannot do it directly in ONNX inside of the model, we drag this state along in the interface, self._state) and it zeroes out the last 64 elements (self._context).

def reset_states(self, batch_size=1):
self._state = torch.zeros((2, batch_size, 128)).float()
self._context = torch.zeros(0)
self._last_sr = 0
self._last_batch_size = 0

This function or its counterpart in the C++ code should be invoked:

  • At the start of the VAD session or alternatively at the end of the VAD session;
  • Or when there is any discountinuity in the data, i.e. track change / microphone change / source change / file change / channel change;

@NathanJHLee
Copy link
Author

Hi @snakers4
Do you have plan to release jit model half? It means quantized model right?

@snakers4
Copy link
Owner

Hi @snakers4 Do you have plan to release jit model half? It means quantized model right?

We used to have quantized models long time ago, bit there were many complaints that they did not run on some platforms. So we decided not to bother anymore since models are small.

@NathanJHLee
Copy link
Author

NathanJHLee commented Sep 24, 2024

Thank you for your answer.
Now i have tested batch inference ,but i got difference probs from silero model.
I though model have some cache to reproduce for next upcomming chunk.
Even though probs needs to check, batch inference shows much much much shorter latency results compared to the single inference. I think it's necessary.
So, I found your documentation that batch inference is possible on silero V3 version.
V5 also supports batch inference too? If yes, i would like to take a closer look.

@github-staff github-staff deleted a comment from Superstar-IT Oct 1, 2024
@snakers4 snakers4 changed the title ❓ Questions Why python and c++ time stamps are different? [C++] Questions Why python and c++ time stamps are different? Oct 11, 2024
@NathanJHLee
Copy link
Author

Hi snakers4! Please check my PR as below.
#578

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants