update hparams #18

fcuni · 2024-07-29T21:47:00Z

Contribution to TODO

tune the hyperparameters so they are not terrible, I just winged it. (currently seeing val loss 2.06, recall count-based 4-gram was 2.11)

These choices improve valid loss to ~2.05. Swept over {learning_rate, weight_decay, batch_size, embedding_size and hidden_size} and found this combination to work best. See below for a plot with the current baseline vs the new setup,

Link to wandb workspace

PS. if desired I can also provide the few lines of code to plot to WandB in a separate PR

karpathy · 2024-07-30T01:06:03Z

Very cool! :)
Although... you basically made the network a lot larger :p is it a lot slower?
It's possible that at this stage we also introduce Dropout and then I think you can probably go even a few notches lower.

karpathy · 2024-07-30T01:12:33Z

Another thought... I wonder if I should merge it. It's kind of a fun exercise itself to optimize it :D

fcuni · 2024-07-30T13:37:54Z

Although... you basically made the network a lot larger :p is it a lot slower?

yeh, i loaded the pytorch version to cuda but the numpy version was quite slow, see a plot below of the runtimes (y-axis is just some dummy variable in this case)

I was annoyed that I could not make the loss lower at the original model size. so i went back over it today and double checked. attaching some plots of slicing of the loss landscape at fixed {embedding_size, hidden_size}, stepping (very coarsely) over {learning_rate, weight_decay, batch_size}. it seems like your "winging it" hparams are great choices at the original model size :)

fcuni · 2024-07-30T13:41:03Z

It's possible that at this stage we also introduce Dropout

That would definitely help, i was playing under the constraint of not touching the model/optimiser to avoid messing up your project :)

if you would like to introduce dropout, i am happy to help with that as well

casinca · 2024-07-31T14:47:25Z

I can't get lower than 2.05-.06 ish either but I kept everything vanilla since it's supposed to be a course, except for what i feel like is ok to fine-tune for everyone, so only grid-search for the optimizer and lr.

Best/easiest result was to raise momentum as close to max, so beta1 to 0.999 and that's it.

Tried also with Optuna didn't get much better result

update hparams

239525a

KarolDuracz mentioned this pull request Sep 2, 2024

Try to beat val loss 2.06 in basic setup #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update hparams #18

update hparams #18

fcuni commented Jul 29, 2024

karpathy commented Jul 30, 2024

karpathy commented Jul 30, 2024

fcuni commented Jul 30, 2024 •

edited

Loading

fcuni commented Jul 30, 2024 •

edited

Loading

casinca commented Jul 31, 2024

update hparams #18

Are you sure you want to change the base?

update hparams #18

Conversation

fcuni commented Jul 29, 2024

karpathy commented Jul 30, 2024

karpathy commented Jul 30, 2024

fcuni commented Jul 30, 2024 • edited Loading

fcuni commented Jul 30, 2024 • edited Loading

casinca commented Jul 31, 2024

fcuni commented Jul 30, 2024 •

edited

Loading

fcuni commented Jul 30, 2024 •

edited

Loading