-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to add a new decoder after gpt is created with ::new call ? #15
Comments
Good question. I have done this manually but it's tricky. You have to dis-allow loading tensors coming after the last layer in your new model, and this way you can migrate your old weights to your new model. Computation is BTreeMap because not all tensors are a results of a computation (E.g input or parameter tensors), and we also need to process them in a sorted way |
Yes. saw that :) |
@cutoken |
It might start from step 0 but will the weights, biases of layer 1 and other intermediate components from training run stay as is or will they be reset to random ? |
new layers are random. old layers keep old weights |
Got it. Now one more ask on similar angle. Is there a way to not run backward pass on a particular layer ? Since they are so costly, I want to exclude earlier layers from the training. Ideal would be some kind of layer numbers I can pass to the optimizer so that it just ignores the computation for those layers. |
You can't do it on a "particular layer". But you can do it only from the last-layer to a particular layer. ( |
Ya that is what I actually want. For example, I have started the training with 2 layers, trained them to death :D (loss not reducing anymore), restarted my training with 4 layers, now I want original layers 0,1 not to be having backward pass while new layers 2, 3 (and their sub components like self attention heads etc) to be trained normally. How do I achieve something like that ? |
@cutoken Pushed something that is useful for you. But be aware, this might confuse the Adam optimizer (Since the gradients of other layers will be zero) |
saw your commit. But is the computations matching the layers ? 🤔 I mean the number of computations done in backward pass is always equal to the number of decoder layers ? Only if they are the commit would work I think @keyvank |
Okay that zero grad problem can solved if we just clone the last layer. It's as good as any random value anyway :) |
@cutoken No, it's not. you have to calculate how many computations are done from your new layers till the last computation |
I think this is the formula:
where |
Got it. I'm thinking of just storing of the layer number in computation during the creation. That way no need for any calculation and it would reliably work. So call function would take the layer number and layer type. If layer type is decoder and layer number is < start layer/limit we do nothing. |
Hmmmm this could work, but I want this library to be general purpose, what you are proposing needs adding GPT logic into the Graph logic. You can just forget about the calcs and use a number like 200, it doesn't need to be accurate hah |
Yes. This doesn't make much sense unless we expose this as a functionality in the library. For now if I get it working I'll keep it as a add-on in my branch. My guess is that fine tuning would be faster with frozen layers but I wouldn't know unless I try the experiments :D . I'll keep you posted on the results. If it is indeed faster we can consider it in the main branch. |
Update on experimenting with this one:
|
hi @keyvank ,
let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)
Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )
The text was updated successfully, but these errors were encountered: