You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
I am wondering about lines 111 to 116 in ppo.cpp. Does it make sense to have mean() for surr_1 and surr_2 here? I was under the impression that surrogate losses should still be per sample at this point, and the mean() is done only after the min at line 124?
auto surr_1 = ratio * mini_batch.advantages.mean();
auto surr_2 = (torch::clamp(ratio,
1.0 - clip_param,
1.0 + clip_param) *
mini_batch.advantages)
.mean();
auto action_loss = -torch::min(surr_1, surr_2).mean();
Also, the way you compute advantages works but might be suboptimal:
advantages = (advantages - advantages.mean() / (advantages.std() + 1e-5));
should probably be
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-5);
The text was updated successfully, but these errors were encountered:
The advantages one is definitely a typo, thanks for catching that.
I'm 99% sure the surrogate means one is also a typo. It seems to have been introduced in a451ad7, and I'm pretty sure it was leftover debug code that I didn't remove before commiting. I'm interested in why it works at all though. I didn't notice any regression in my game that's using the library.
While I was looking at this, I also found #15. And come to think of it the surrogate mean thing is a likely cause of that.
I'll do some tests and figure out #15 first, then if this turns out to be unrelated I'll fix it separately.
Looks like this was a major contributor to #15, but I feel like it's still not performing quite as well as it used to. I'll run some more detailed tests when I get time.
I am wondering about lines 111 to 116 in ppo.cpp. Does it make sense to have mean() for surr_1 and surr_2 here? I was under the impression that surrogate losses should still be per sample at this point, and the mean() is done only after the min at line 124?
auto surr_1 = ratio * mini_batch.advantages.mean();
auto surr_2 = (torch::clamp(ratio,
1.0 - clip_param,
1.0 + clip_param) *
mini_batch.advantages)
.mean();
auto action_loss = -torch::min(surr_1, surr_2).mean();
Also, the way you compute advantages works but might be suboptimal:
advantages = (advantages - advantages.mean() / (advantages.std() + 1e-5));
should probably be
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-5);
The text was updated successfully, but these errors were encountered: