-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12
Comments
Hi Enrico, Many thanks for the swift response! Please see the wandb output for val_acc1 on ImageNet-100, for all the 5 checkpoints As is evident, the last model (task4, the highest) reaches to 62% accuracy at the very end of the linear-probing. Please see below my offline linear probing parameters, equivalent to yours:
Is the accuracy in the paper just the accuracy of the final model (which I found as 62%)? It would be great if you can share the checkpoint indeed. Then I can debug my evaluation code. It would be great if you can share the evaluation script as well. It does not have to be clean, just to give the clearest idea possible. Thank you. |
Thanks for your response.
Will have a look at tuning parameters further. Thank you. |
I found some checkpoints that might be relevant: https://drive.google.com/drive/folders/1gOejzl4Q0cqAcmEjUhyStYPDbXPn1o9R?usp=share_link I am not 100% sure that this is the correct checkpoint, so use it at your own risk. EDIT: this checkpoint was probably obtained with a different version of the code, you might have issues resuming it |
Yes, your curves look similar to mine. I think it is likely to be due to hyperparam tuning of the offline linear eval. Also, always remember that there might be some randomness involved, so a small decrease in performance might be due to that. |
Thanks for the model, args and the info. I will have a look at these. Thanks! |
One last thing that just came to my mind. We recently found that Pillow-SIMD can have a detrimental effect on some models (see the issue here vturrisi/solo-learn#313). I am not sure if we used it or not in our experiments. Might be another thing to check on. EDIT: also make sure you use Dali for pre-training. |
Cool. I was using it, actually. Will try without it and report any difference. |
How much did you get with online linear eval? |
Generally 4-5% below the offline counterpart. |
Ok, so around 57. The checkpoint that I shared should have online eval accuracy 58.8%. |
Hi,
Thanks a lot for your amazing work and releasing the code. I am trying to reproduce your Table 4 for sometime. I directly use the code and the scripts with NO modification.
For example, in this Table, BYOL fine-tuning on ImageNet-100 for 5-class incremental task performance is 66.0. Instead, I measured below <<60.0, at least 6% below. Please see the full results Table below if interested (a 5 x 5 Table).
results.pdf
Any idea what may be causing the gap? Is there any nuances in evaluation method? For example, for average accuracy, I simply take the mean of the below Table across all rows and colums (as also suggested by GEM, as you referenced).
Thanks a lot again for your response and your eye-opening work.
The text was updated successfully, but these errors were encountered: