Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update hparams #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

update hparams #18

wants to merge 1 commit into from

Conversation

fcuni
Copy link

@fcuni fcuni commented Jul 29, 2024

Contribution to TODO

tune the hyperparameters so they are not terrible, I just winged it. (currently seeing val loss 2.06, recall count-based 4-gram was 2.11)

These choices improve valid loss to ~2.05. Swept over {learning_rate, weight_decay, batch_size, embedding_size and hidden_size} and found this combination to work best. See below for a plot with the current baseline vs the new setup,
image

Link to wandb workspace

PS. if desired I can also provide the few lines of code to plot to WandB in a separate PR

@karpathy
Copy link
Contributor

Very cool! :)
Although... you basically made the network a lot larger :p is it a lot slower?
It's possible that at this stage we also introduce Dropout and then I think you can probably go even a few notches lower.

@karpathy
Copy link
Contributor

Another thought... I wonder if I should merge it. It's kind of a fun exercise itself to optimize it :D

@fcuni
Copy link
Author

fcuni commented Jul 30, 2024

Although... you basically made the network a lot larger :p is it a lot slower?

yeh, i loaded the pytorch version to cuda but the numpy version was quite slow, see a plot below of the runtimes (y-axis is just some dummy variable in this case)

image

I was annoyed that I could not make the loss lower at the original model size. so i went back over it today and double checked. attaching some plots of slicing of the loss landscape at fixed {embedding_size, hidden_size}, stepping (very coarsely) over {learning_rate, weight_decay, batch_size}. it seems like your "winging it" hparams are great choices at the original model size :)

batch_size_64 0
batch_size_128 0
batch_size_256 0
batch_size_512 0

@fcuni
Copy link
Author

fcuni commented Jul 30, 2024

It's possible that at this stage we also introduce Dropout

That would definitely help, i was playing under the constraint of not touching the model/optimiser to avoid messing up your project :)

if you would like to introduce dropout, i am happy to help with that as well

@casinca
Copy link

casinca commented Jul 31, 2024

I can't get lower than 2.05-.06 ish either but I kept everything vanilla since it's supposed to be a course, except for what i feel like is ok to fine-tune for everyone, so only grid-search for the optimizer and lr.

Best/easiest result was to raise momentum as close to max, so beta1 to 0.999 and that's it.
image

Tried also with Optuna didn't get much better result
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants