-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training and Testing code #3
Comments
Hi, If you just want to train the model using the face landmarks and the audio, it's somehow simple to create a dataloader. You just need 2 audios + the landmarks corresponding to the target face. The audios are normalized wrt the absolute max. The core network is the one called
This will create the mixtures for you.
I can give you parts of the code but I don't think it's gonna be more clear than this one (which has been polished) The refinement network is trained once the main stage is trained as described in the paper. Best |
the weight is.
|
Hi @JuanFMontesinos , Thank You for the code. Great paper! I am indeed trying to replicate the training. I had a few questions and would be grateful if you could please answer them:
Do you have a different preprocess_audio while training? or am I missing something here?
Thank You! |
Hi @Sreyan88
where the mix is created here:
We are basically computing a binary mask created from random coefficients and permuting the elements of the batch.
|
Hi @JuanFMontesinos , Thank You so much for this! I will try to keep this discussion alive till I am being able to replicate the training! |
Sure @Sreyan88 |
Hi @JuanFMontesinos , I have two questions and would be grateful if you could answer them! (1) Can you please provide me with the code for: complex_division() |
Hi @Sreyan88 VoViT/vovit/core/models/production_model.py Lines 68 to 73 in dfd11c2
Regarding the masks, masks are usually defined as: In our case we divide by n_sources to remove that scaling factor. So that our masks are defined as The interpretation is the following. If you do not divide by the scaling factor, you are making the network to estimate how many sources are there, and compensating to recover the original loudness. If you divide by the scaling factor, the network just isolates the source at its current loudness. Since we know the number of sources, we just proceeded as in the second case. However it's something very silly and picky, your shouldn't worry about it. |
Thank You so much for your reply! Can you also please help me with your stft parameters? |
As stated in Sec. 3.2. in the paper, audio is resampled to 16384 Hz and stft is computed with a hop size of 256 and nfft 1022 |
Can you please provide the code for the training and testing process? Meanwhile, I wonder about the detailed training setting, such as batch size. Thanks a lot!
The text was updated successfully, but these errors were encountered: